[00:01:54] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:04:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P32976 and previous config saved to /var/cache/conftool/dbconfig/20220825-000443-ladsgroup.json
[00:05:12] <icinga-wm>	 PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:16] <icinga-wm>	 PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:06] <icinga-wm>	 PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:06:30] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:19:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T314041)', diff saved to https://phabricator.wikimedia.org/P32977 and previous config saved to /var/cache/conftool/dbconfig/20220825-001949-ladsgroup.json
[00:19:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[00:19:55] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[00:20:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2098.codfw.wmnet with reason: Maintenance
[00:20:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[00:21:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
[00:21:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T314041)', diff saved to https://phabricator.wikimedia.org/P32978 and previous config saved to /var/cache/conftool/dbconfig/20220825-002120-ladsgroup.json
[00:23:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T314041)', diff saved to https://phabricator.wikimedia.org/P32979 and previous config saved to /var/cache/conftool/dbconfig/20220825-002306-ladsgroup.json
[00:29:05] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10CDunn) Approved
[00:32:38] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.265 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:34:52] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:38:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P32980 and previous config saved to /var/cache/conftool/dbconfig/20220825-003812-ladsgroup.json
[00:42:58] <icinga-wm>	 RECOVERY - Check systemd state on logstash1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:43:08] <icinga-wm>	 RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:44:28] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.241 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:46:46] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift
[00:53:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P32981 and previous config saved to /var/cache/conftool/dbconfig/20220825-005318-ladsgroup.json
[01:08:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T314041)', diff saved to https://phabricator.wikimedia.org/P32982 and previous config saved to /var/cache/conftool/dbconfig/20220825-010824-ladsgroup.json
[01:08:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[01:08:30] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[01:08:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
[01:08:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T314041)', diff saved to https://phabricator.wikimedia.org/P32983 and previous config saved to /var/cache/conftool/dbconfig/20220825-010845-ladsgroup.json
[01:10:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T314041)', diff saved to https://phabricator.wikimedia.org/P32984 and previous config saved to /var/cache/conftool/dbconfig/20220825-011032-ladsgroup.json
[01:25:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P32985 and previous config saved to /var/cache/conftool/dbconfig/20220825-012538-ladsgroup.json
[01:27:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:36:45] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job redis_gitlab in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:40:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P32986 and previous config saved to /var/cache/conftool/dbconfig/20220825-014044-ladsgroup.json
[01:41:45] <jinxer-wm>	 (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:46:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:51:45] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:55:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T314041)', diff saved to https://phabricator.wikimedia.org/P32987 and previous config saved to /var/cache/conftool/dbconfig/20220825-015550-ladsgroup.json
[01:55:53] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[01:55:56] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[01:56:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
[01:56:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32988 and previous config saved to /var/cache/conftool/dbconfig/20220825-015612-ladsgroup.json
[01:58:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32989 and previous config saved to /var/cache/conftool/dbconfig/20220825-015800-ladsgroup.json
[02:00:02] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:12] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: package_builder_Clean_up_build_directory.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:11:45] <jinxer-wm>	 (JobUnavailable) resolved: (5) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:13:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P32990 and previous config saved to /var/cache/conftool/dbconfig/20220825-021306-ladsgroup.json
[02:21:12] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.207 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:25:50] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.024 second response time https://wikitech.wikimedia.org/wiki/Swift
[02:28:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P32991 and previous config saved to /var/cache/conftool/dbconfig/20220825-022812-ladsgroup.json
[02:43:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32992 and previous config saved to /var/cache/conftool/dbconfig/20220825-024318-ladsgroup.json
[02:43:20] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[02:43:24] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[02:43:34] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance
[02:43:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2168:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32993 and previous config saved to /var/cache/conftool/dbconfig/20220825-024339-ladsgroup.json
[02:45:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32994 and previous config saved to /var/cache/conftool/dbconfig/20220825-024527-ladsgroup.json
[02:56:54] <icinga-wm>	 PROBLEM - Swift https backend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.286 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:00:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P32995 and previous config saved to /var/cache/conftool/dbconfig/20220825-030033-ladsgroup.json
[03:01:20] <icinga-wm>	 RECOVERY - Swift https backend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.031 second response time https://wikitech.wikimedia.org/wiki/Swift
[03:09:04] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:15:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318', diff saved to https://phabricator.wikimedia.org/P32996 and previous config saved to /var/cache/conftool/dbconfig/20220825-031539-ladsgroup.json
[03:16:10] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:23:10] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:25:33] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:27:24] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:12] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:30:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2168:3318 (T314041)', diff saved to https://phabricator.wikimedia.org/P32997 and previous config saved to /var/cache/conftool/dbconfig/20220825-033045-ladsgroup.json
[03:30:47] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[03:30:51] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[03:31:01] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
[03:31:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T314041)', diff saved to https://phabricator.wikimedia.org/P32998 and previous config saved to /var/cache/conftool/dbconfig/20220825-033107-ladsgroup.json
[03:32:10] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:32:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T314041)', diff saved to https://phabricator.wikimedia.org/P32999 and previous config saved to /var/cache/conftool/dbconfig/20220825-033253-ladsgroup.json
[03:41:43] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:42:10] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:28] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:46:54] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:48:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P33000 and previous config saved to /var/cache/conftool/dbconfig/20220825-034759-ladsgroup.json
[04:03:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P33001 and previous config saved to /var/cache/conftool/dbconfig/20220825-040306-ladsgroup.json
[04:08:06] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:15:16] <icinga-wm>	 PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:18:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T314041)', diff saved to https://phabricator.wikimedia.org/P33002 and previous config saved to /var/cache/conftool/dbconfig/20220825-041812-ladsgroup.json
[04:18:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[04:18:17] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[04:18:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
[04:18:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T314041)', diff saved to https://phabricator.wikimedia.org/P33003 and previous config saved to /var/cache/conftool/dbconfig/20220825-041833-ladsgroup.json
[04:20:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T314041)', diff saved to https://phabricator.wikimedia.org/P33004 and previous config saved to /var/cache/conftool/dbconfig/20220825-042020-ladsgroup.json
[04:23:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) That's ok from my side
[04:25:56] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q4:(Need By: TBD) rack/setup/install db1196.eqiad.wmnet - db1203.eqiad.wmnet - https://phabricator.wikimedia.org/T306848 (10Marostegui) Please note that the last hostnames should be: db1201 db1202 db1203
[04:35:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P33005 and previous config saved to /var/cache/conftool/dbconfig/20220825-043527-ladsgroup.json
[04:41:10] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/826385 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite)
[04:50:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P33006 and previous config saved to /var/cache/conftool/dbconfig/20220825-045033-ladsgroup.json
[05:05:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T314041)', diff saved to https://phabricator.wikimedia.org/P33007 and previous config saved to /var/cache/conftool/dbconfig/20220825-050539-ladsgroup.json
[05:05:45] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[05:06:52] <wikibugs>	 (03PS1) 10Marostegui: db1186: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826416 (https://phabricator.wikimedia.org/T313569)
[05:07:14] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1130', diff saved to https://phabricator.wikimedia.org/P33008 and previous config saved to /var/cache/conftool/dbconfig/20220825-050713-root.json
[05:08:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1186: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826416 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:09:50] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1186 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826417 (https://phabricator.wikimedia.org/T313569)
[05:10:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1186 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826417 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:11:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1186 to dbctl', diff saved to https://phabricator.wikimedia.org/P33010 and previous config saved to /var/cache/conftool/dbconfig/20220825-051130-marostegui.json
[05:11:56] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1186 with minimal weight in s1 T313569', diff saved to https://phabricator.wikimedia.org/P33011 and previous config saved to /var/cache/conftool/dbconfig/20220825-051155-root.json
[05:12:00] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:13:52] <wikibugs>	 (03PS1) 10Marostegui: db1188: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826418 (https://phabricator.wikimedia.org/T313569)
[05:14:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1188: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826418 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:15:41] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1188 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826419 (https://phabricator.wikimedia.org/T313569)
[05:16:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1188 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826419 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:17:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1188 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33012 and previous config saved to /var/cache/conftool/dbconfig/20220825-051737-marostegui.json
[05:17:42] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:17:55] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1188 with minimal weight in s2 T313569', diff saved to https://phabricator.wikimedia.org/P33013 and previous config saved to /var/cache/conftool/dbconfig/20220825-051754-root.json
[05:18:43] <wikibugs>	 (03PS1) 10Marostegui: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334
[05:18:52] <wikibugs>	 (03PS1) 10Marostegui: Revert "parsercache: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/826335
[05:19:08] <wikibugs>	 (03PS1) 10Marostegui: Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826336
[05:19:17] <wikibugs>	 (03PS2) 10Marostegui: Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826336
[05:22:32] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "pc1014: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826336 (owner: 10Marostegui)
[05:23:22] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper)
[05:23:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 33 hosts with reason: Primary switchover s4 T315419
[05:23:48] <stashbot>	 T315419: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T315419
[05:23:57] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to > 2.44.10 - https://phabricator.wikimedia.org/T265549 (10Aklapper)
[05:23:59] <wikibugs>	 10SRE, 10Thumbor, 10Wikimedia-SVG-rendering, 10Upstream: Update librsvg to ≥2.42.3 (2.44.10) - https://phabricator.wikimedia.org/T193352 (10Aklapper)
[05:24:06] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 33 hosts with reason: Primary switchover s4 T315419
[05:24:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1160 with weight 0 T315419', diff saved to https://phabricator.wikimedia.org/P33015 and previous config saved to /var/cache/conftool/dbconfig/20220825-052415-ladsgroup.json
[05:25:26] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Display page namespace with spaces instead of underscores when page doesn't exist [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826332 (https://phabricator.wikimedia.org/T316092) (owner: 10Ladsgroup)
[05:25:45] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 (owner: 10Marostegui)
[05:26:38] <wikibugs>	 (03PS1) 10Marostegui: db1190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826420
[05:29:13] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[05:29:19] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1190: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826420 (owner: 10Marostegui)
[05:30:45] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1190 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826421 (https://phabricator.wikimedia.org/T313569)
[05:32:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1190 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826421 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:32:53] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1190 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33016 and previous config saved to /var/cache/conftool/dbconfig/20220825-053253-marostegui.json
[05:32:58] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:33:11] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1190 with minimal weight in s4 T313569', diff saved to https://phabricator.wikimedia.org/P33017 and previous config saved to /var/cache/conftool/dbconfig/20220825-053310-root.json
[05:33:19] <wikibugs>	 (03CR) 10Andrea Denisse: "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[05:33:24] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[05:34:03] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[05:34:30] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite)
[05:35:59] <wikibugs>	 (03PS1) 10Marostegui: db1191: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826422 (https://phabricator.wikimedia.org/T313569)
[05:37:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1191: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/826422 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:40:29] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826423 (https://phabricator.wikimedia.org/T313569)
[05:41:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826423 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:41:36] <wikibugs>	 (03Merged) 10jenkins-bot: Display page namespace with spaces instead of underscores when page doesn't exist [core] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826332 (https://phabricator.wikimedia.org/T316092) (owner: 10Ladsgroup)
[05:43:27] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826424 (https://phabricator.wikimedia.org/T313569)
[05:44:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1185 [puppet] - 10https://gerrit.wikimedia.org/r/826424 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:45:29] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[05:46:08] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[05:46:08] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized php-1.39.0-wmf.26/includes/page/Article.php: Backport: [[gerrit:826332|Display page namespace with spaces instead of underscores when page doesn't exist (T316092)]] (duration: 03m 32s)
[05:46:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[05:46:14] <stashbot>	 T316092: Underscore displayed in namespace prefix for non-existent pages (e.g. "User_talk") - https://phabricator.wikimedia.org/T316092
[05:46:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[05:48:10] <wikibugs>	 (03PS1) 10Marostegui: instances.yaml: Add db1191 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826425 (https://phabricator.wikimedia.org/T313569)
[05:49:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db1191 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/826425 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[05:50:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db1191 to dbctl T313569', diff saved to https://phabricator.wikimedia.org/P33018 and previous config saved to /var/cache/conftool/dbconfig/20220825-055038-marostegui.json
[05:50:43] <stashbot>	 T313569: Productionize db1185.eqiad.wmnet - db1195.eqiad.wmnet - https://phabricator.wikimedia.org/T313569
[05:50:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Pool db1191 with minimal weight in s7 T313569', diff saved to https://phabricator.wikimedia.org/P33019 and previous config saved to /var/cache/conftool/dbconfig/20220825-055057-root.json
[05:58:43] <Amir1>	 sigh, I haven't moved anything and it's stuck on the 10.6 replica for a full half an hour now
[05:59:15] <marostegui>	 what was the timeout?
[05:59:20] <Amir1>	 25
[05:59:32] <Amir1>	 I'm fairly certain it passed 25 minutes
[05:59:48] <marostegui>	 yeah, the problem is that that host is so stuck that even the kills aren't working
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: My dear minions, it's time we take the moon! Just kidding. Time for Primary database switchover deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T0600).
[06:00:08] <Amir1>	 ah it timed out now
[06:00:08] <marostegui>	 so the did it went thru now?
[06:00:10] <marostegui>	 yeah
[06:00:11] <marostegui>	 I forced it
[06:00:12] <marostegui>	 he
[06:00:27] <Amir1>	 now we need to do the rest. Should I re-run it?
[06:00:41] <marostegui>	 yes, but I wonder if it will attempt to go for db1143 again
[06:00:43] <wikibugs>	 (03PS1) 10Andrea Denisse: librenms: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[06:00:51] <Amir1>	 :(
[06:01:01] <marostegui>	 let me try one thing
[06:01:10] <marostegui>	 what's the new master, db1160?
[06:01:13] <Amir1>	 yup
[06:01:16] <wikibugs>	 (03PS2) 10Andrea Denisse: librenms: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[06:02:34] <marostegui>	 Amir1: ok, so use db-move-replica with each instance, so you can leave db1143 aside. I just ran this: db-move-replica --timeout 25 db1141 db1160 and it worked 
[06:02:41] <marostegui>	 you can continue with all the other hosts
[06:02:52] <Amir1>	 I see 
[06:02:53] <Amir1>	 ok
[06:03:00] <wikibugs>	 (03PS3) 10Andrea Denisse: librenms: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[06:03:49] <Amir1>	 marostegui: otoh, db1141 is lagging behind (like db1147)
[06:03:57] <Amir1>	 semi sync again?\
[06:04:26] <marostegui>	 yeah
[06:04:28] <marostegui>	 just fixed it
[06:04:34] <Amir1>	 thanks
[06:04:39] <marostegui>	 I think db-switchover does disable it before every move
[06:06:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1114', diff saved to https://phabricator.wikimedia.org/P33020 and previous config saved to /var/cache/conftool/dbconfig/20220825-060601-root.json
[06:08:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 5%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33022 and previous config saved to /var/cache/conftool/dbconfig/20220825-060816-root.json
[06:12:16] <wikibugs>	 (03PS1) 10Andrea Denisse: librenms: Reserve id for the LibreNMS user; Use systemd::sysuser instead of user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388)
[06:13:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] librenms: Reserve id for the LibreNMS user; Use systemd::sysuser instead of user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[06:14:49] <wikibugs>	 (03PS4) 10Andrea Denisse: netmon: Reserve UID/GID for the LibreNMS system user. [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388)
[06:16:21] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388)
[06:21:46] <wikibugs>	 (03PS2) 10Ladsgroup: mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/824147 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot)
[06:21:51] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1160 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/824147 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot)
[06:22:22] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388)
[06:22:37] <wikibugs>	 (03Abandoned) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826429 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[06:22:39] <Amir1>	 !log Starting s4 eqiad failover from db1138 to db1160 - T315419
[06:22:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:22:44] <stashbot>	 T315419: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T315419
[06:23:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 10%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33023 and previous config saved to /var/cache/conftool/dbconfig/20220825-062321-root.json
[06:23:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s4 eqiad as read-only for maintenance - T315419', diff saved to https://phabricator.wikimedia.org/P33024 and previous config saved to /var/cache/conftool/dbconfig/20220825-062353-ladsgroup.json
[06:24:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1160 to s4 primary and set section read-write T315419', diff saved to https://phabricator.wikimedia.org/P33025 and previous config saved to /var/cache/conftool/dbconfig/20220825-062425-ladsgroup.json
[06:26:00] <wikibugs>	 (03CR) 10Muehlenhoff: "Did you capture the error, what was failing specifically?" [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn)
[06:26:45] <wikibugs>	 (03PS2) 10Ladsgroup: wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/824148 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot)
[06:26:50] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s4-master alias [dns] - 10https://gerrit.wikimedia.org/r/824148 (https://phabricator.wikimedia.org/T315419) (owner: 10Gerrit maintenance bot)
[06:26:55] <wikibugs>	 (03PS2) 10Andrea Denisse: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388)
[06:28:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1138 T315419', diff saved to https://phabricator.wikimedia.org/P33026 and previous config saved to /var/cache/conftool/dbconfig/20220825-062852-ladsgroup.json
[06:28:57] <stashbot>	 T315419: Switchover s4 master (db1138 -> db1160) - https://phabricator.wikimedia.org/T315419
[06:29:16] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/pcc-worker1001/36966/" [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[06:30:22] <icinga-wm>	 RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:32:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maint on s4 old master
[06:32:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maint on s4 old master
[06:34:10] <wikibugs>	 (03PS1) 10Andrea Denisse: doc: Fix smalll typos in the systemd::sysuser documentation. [puppet] - 10https://gerrit.wikimedia.org/r/826490
[06:34:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[06:34:36] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1138.eqiad.wmnet with reason: Maintenance
[06:34:38] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[06:35:00] <wikibugs>	 (03CR) 10Muehlenhoff: "Isn't profile::mediawiki::common a more logical choice? I think we also want this on the snapshot* hosts as well, having dumps complete fa" [puppet] - 10https://gerrit.wikimedia.org/r/826405 (https://phabricator.wikimedia.org/T315398) (owner: 10Tim Starling)
[06:35:02] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db1143.eqiad.wmnet with reason: Maintenance
[06:37:26] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.138 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:38:26] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 25%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33027 and previous config saved to /var/cache/conftool/dbconfig/20220825-063826-root.json
[06:38:38] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[06:39:50] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:42:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10netops, and 2 others: LibreNMS seemingly not collecting data for many ports after migration to netmon1003 - https://phabricator.wikimedia.org/T314972 (10andrea.denisse)
[06:43:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse) 05Open→03In progress
[06:46:10] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:46:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Metrics, 10Patch-For-Review: Define a fleetwide uid and gid mappings for the Netmon instances containing LibreNMS and Rancid. - https://phabricator.wikimedia.org/T315388 (10andrea.denisse)
[06:48:52] <wikibugs>	 (03PS2) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673)
[06:49:14] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[06:50:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[06:50:57] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
[06:50:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[06:51:23] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2094.codfw.wmnet with reason: Maintenance
[06:51:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T314041)', diff saved to https://phabricator.wikimedia.org/P33028 and previous config saved to /var/cache/conftool/dbconfig/20220825-065128-ladsgroup.json
[06:51:33] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[06:51:42] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:52:57] <wikibugs>	 (03PS3) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673)
[06:53:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T314041)', diff saved to https://phabricator.wikimedia.org/P33029 and previous config saved to /var/cache/conftool/dbconfig/20220825-065315-ladsgroup.json
[06:53:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 50%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33030 and previous config saved to /var/cache/conftool/dbconfig/20220825-065331-root.json
[06:56:20] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[06:56:47] <wikibugs>	 (03CR) 10Slyngshede: c:dynamicproxy clean up after cronjob removal. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[06:59:34] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] c:dynamicproxy clean up after cronjob removal. [puppet] - 10https://gerrit.wikimedia.org/r/826287 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[07:00:04] <jouncebot>	 Amir1, apergos, jnuche, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC morning backport and config training deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T0700).
[07:00:14] <apergos>	 good morning! there are no trainees signed up today and no patches scheduled  in the window.
[07:00:16] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:00:48] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.163 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:01:19] <jynus>	 that looks not great
[07:01:51] <jynus>	 could be related to the problem Amir rised?
[07:03:00] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Swift
[07:03:42] <RhinosF1>	 jynus: just seen a report on irc of someone getting "04:00:48 <dmacks_away> On commons, "Error deleting file: An unknown error occurred in storage backend "local-swift-eqiad". ""
[07:03:53] <RhinosF1>	 That's 4 hours ago
[07:06:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline." [puppet] - 10https://gerrit.wikimedia.org/r/826427 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:08:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P33031 and previous config saved to /var/cache/conftool/dbconfig/20220825-070821-ladsgroup.json
[07:08:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 75%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33032 and previous config saved to /var/cache/conftool/dbconfig/20220825-070835-root.json
[07:11:58] <icinga-wm>	 PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:12:32] <wikibugs>	 (03CR) 10Muehlenhoff: netmon: Use systemd::sysuser and reserve id for the LibreNMS user. (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/826431 (https://phabricator.wikimedia.org/T315388) (owner: 10Andrea Denisse)
[07:12:53] <jynus>	 there seems to be at times spikes of 504 from eqiad
[07:13:01] <jynus>	 *from codfw, not eqiad
[07:13:20] <jynus>	 could be some higher network latency or a proxy overload
[07:16:26] <icinga-wm>	 RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:17:22] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "This is ready- x1 snapshots on codfw failed twice, but can be retried after maintenance." [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui)
[07:18:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup)
[07:23:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P33033 and previous config saved to /var/cache/conftool/dbconfig/20220825-072327-ladsgroup.json
[07:23:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1130 (re)pooling @ 100%: Repooling after cloning db1185', diff saved to https://phabricator.wikimedia.org/P33034 and previous config saved to /var/cache/conftool/dbconfig/20220825-072340-root.json
[07:29:18] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] "This was meant to be: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/826420 (owner: 10Marostegui)
[07:30:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 (owner: 10Marostegui)
[07:30:53] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "ProductionServices.php: Promote pc1014 to pc2 master" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826334 (owner: 10Marostegui)
[07:31:41] <wikibugs>	 (03PS2) 10Marostegui: Revert "parsercache: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/826335
[07:32:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "parsercache: Promote pc1014 to pc2 master" [puppet] - 10https://gerrit.wikimedia.org/r/826335 (owner: 10Marostegui)
[07:34:40] <marostegui>	 !log Promote pc1012 back as pc2 master T315526
[07:34:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:34:44] <stashbot>	 T315526: Promote pc1014 to pc2 master - https://phabricator.wikimedia.org/T315526
[07:36:09] <logmsgbot>	 !log marostegui@deploy1002 Synchronized wmf-config/ProductionServices.php: Promote pc1012 to pc2 master T315526 (duration: 03m 39s)
[07:38:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[07:38:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T314041)', diff saved to https://phabricator.wikimedia.org/P33035 and previous config saved to /var/cache/conftool/dbconfig/20220825-073834-ladsgroup.json
[07:38:36] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[07:38:38] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[07:38:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
[07:38:55] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T314041)', diff saved to https://phabricator.wikimedia.org/P33036 and previous config saved to /var/cache/conftool/dbconfig/20220825-073855-ladsgroup.json
[07:39:18] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[07:39:19] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[07:40:20] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[07:40:41] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T314041)', diff saved to https://phabricator.wikimedia.org/P33037 and previous config saved to /var/cache/conftool/dbconfig/20220825-074041-ladsgroup.json
[07:40:51] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1185, db1186 and db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826494
[07:42:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1137.eqiad.wmnet with reason: Maintenance
[07:42:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1185, db1186 and db1188 [puppet] - 10https://gerrit.wikimedia.org/r/826494 (owner: 10Marostegui)
[07:42:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1137.eqiad.wmnet with reason: Maintenance
[07:42:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1137 (T312160)', diff saved to https://phabricator.wikimedia.org/P33038 and previous config saved to /var/cache/conftool/dbconfig/20220825-074220-ladsgroup.json
[07:42:25] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[07:43:07] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33039 and previous config saved to /var/cache/conftool/dbconfig/20220825-074307-root.json
[07:43:16] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33040 and previous config saved to /var/cache/conftool/dbconfig/20220825-074315-root.json
[07:43:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33041 and previous config saved to /var/cache/conftool/dbconfig/20220825-074323-root.json
[07:44:01] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 2%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33042 and previous config saved to /var/cache/conftool/dbconfig/20220825-074400-root.json
[07:44:38] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1192 [puppet] - 10https://gerrit.wikimedia.org/r/826495 (https://phabricator.wikimedia.org/T313569)
[07:45:39] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1192 [puppet] - 10https://gerrit.wikimedia.org/r/826495 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[07:51:19] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Switchover m1 T315864
[07:51:24] <stashbot>	 T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864
[07:51:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db[2132,2160].codfw.wmnet,db[1117,1164,1195].eqiad.wmnet with reason: Switchover m1 T315864
[07:52:52] <wikibugs>	 (03PS1) 10Slyngshede: c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496
[07:54:29] <wikibugs>	 (03PS3) 10Marostegui: mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864)
[07:55:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P33044 and previous config saved to /var/cache/conftool/dbconfig/20220825-075547-ladsgroup.json
[07:56:22] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1195 to m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826222 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui)
[07:58:12] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33045 and previous config saved to /var/cache/conftool/dbconfig/20220825-075811-root.json
[07:58:21] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33046 and previous config saved to /var/cache/conftool/dbconfig/20220825-075820-root.json
[07:58:28] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33047 and previous config saved to /var/cache/conftool/dbconfig/20220825-075828-root.json
[07:59:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 3%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33048 and previous config saved to /var/cache/conftool/dbconfig/20220825-075905-root.json
[07:59:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P33049 and previous config saved to /var/cache/conftool/dbconfig/20220825-075924-root.json
[08:00:04] <jouncebot>	 hashar and dduvall: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-0+Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T0800).
[08:01:17] <wikibugs>	 (03PS2) 10Slyngshede: c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496
[08:03:13] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826498 (https://phabricator.wikimedia.org/T313569)
[08:03:57] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.190 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:04:53] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Swift
[08:05:05] <wikibugs>	 (03PS3) 10Slyngshede: c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496
[08:06:38] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826498 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[08:06:46] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36970/console" [puppet] - 10https://gerrit.wikimedia.org/r/826496 (owner: 10Slyngshede)
[08:07:01] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:09:26] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] c:spamassassin debug alerting on Spamassassin timer. [puppet] - 10https://gerrit.wikimedia.org/r/826496 (owner: 10Slyngshede)
[08:09:58] <marostegui>	 !log Reboot db1195 for kernel upgrade T315864
[08:10:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:10:03] <stashbot>	 T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864
[08:10:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P33050 and previous config saved to /var/cache/conftool/dbconfig/20220825-081053-ladsgroup.json
[08:10:56] <jynus>	 I am going to stop bacula for some time, please avoid accidental deleting of production data in the next hour or so
[08:12:53] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1014 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[08:12:59] <marostegui>	 ^ me
[08:13:01] <jynus>	 !log stopping bacula services on backup1001 T315864
[08:13:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:07] <claime>	 jynus: Oh man there goes my morning task of dropping the prod databases :(
[08:13:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33051 and previous config saved to /var/cache/conftool/dbconfig/20220825-081316-root.json
[08:13:18] <claime>	 (sorry)
[08:13:25] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33052 and previous config saved to /var/cache/conftool/dbconfig/20220825-081325-root.json
[08:13:27] <jynus>	 claime: please wait until maintenance is complete, apologies for disturbance
[08:13:33] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33053 and previous config saved to /var/cache/conftool/dbconfig/20220825-081333-root.json
[08:13:36] <claime>	 x)
[08:13:44] <jynus>	 it should be done in less than 1h
[08:14:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 4%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33054 and previous config saved to /var/cache/conftool/dbconfig/20220825-081410-root.json
[08:14:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P33055 and previous config saved to /var/cache/conftool/dbconfig/20220825-081429-root.json
[08:14:37] <icinga-wm>	 PROBLEM - MediaWiki EtcdConfig up-to-date on mw2396 is CRITICAL: etcd last index (1119153) is outdated compared to the master one (1119159) https://wikitech.wikimedia.org/wiki/Etcd
[08:15:15] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:15:53] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1014 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[08:16:05] <icinga-wm>	 RECOVERY - MediaWiki EtcdConfig up-to-date on mw2396 is OK: etcd last index (1119159) matches the master one (1119159) https://wikitech.wikimedia.org/wiki/Etcd
[08:17:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:17:56] <jynus>	 ^that is me
[08:18:01] <jynus>	 bacula is down at the moment
[08:19:38] <wikibugs>	 (03CR) 10Vgutierrez: Varnish: Stop sending analytics cookies to API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall)
[08:22:14] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] Increase roll-out of query-sorting to 5% [puppet] - 10https://gerrit.wikimedia.org/r/826398 (https://phabricator.wikimedia.org/T314868) (owner: 10Ori)
[08:22:51] <vgutierrez>	 Increase roll-out of query-sorting to 5%
[08:23:10] <vgutierrez>	 !log Increase roll-out of query-sorting to 5% - T314868
[08:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:14] <stashbot>	 T314868: Roll out query parameter normalization - https://phabricator.wikimedia.org/T314868
[08:26:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T314041)', diff saved to https://phabricator.wikimedia.org/P33056 and previous config saved to /var/cache/conftool/dbconfig/20220825-082559-ladsgroup.json
[08:26:01] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[08:26:04] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[08:26:15] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
[08:26:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T314041)', diff saved to https://phabricator.wikimedia.org/P33057 and previous config saved to /var/cache/conftool/dbconfig/20220825-082621-ladsgroup.json
[08:28:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T314041)', diff saved to https://phabricator.wikimedia.org/P33058 and previous config saved to /var/cache/conftool/dbconfig/20220825-082807-ladsgroup.json
[08:28:11] <wikibugs>	 (03CR) 10JMeybohm: [C: 04-1] Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[08:28:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33059 and previous config saved to /var/cache/conftool/dbconfig/20220825-082821-root.json
[08:28:30] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33060 and previous config saved to /var/cache/conftool/dbconfig/20220825-082830-root.json
[08:28:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33061 and previous config saved to /var/cache/conftool/dbconfig/20220825-082837-root.json
[08:29:15] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 5%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33062 and previous config saved to /var/cache/conftool/dbconfig/20220825-082915-root.json
[08:29:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P33063 and previous config saved to /var/cache/conftool/dbconfig/20220825-082933-root.json
[08:30:01] <marostegui>	 !log Failover m1 from db1164 to db1195 - T315864
[08:30:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:30:05] <stashbot>	 T315864: Switchover m1 master (db1164 -> db1195) - https://phabricator.wikimedia.org/T315864
[08:30:42] <marostegui>	 done
[08:33:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbbackups: Replace m1 master [puppet] - 10https://gerrit.wikimedia.org/r/826223 (https://phabricator.wikimedia.org/T315864) (owner: 10Marostegui)
[08:39:40] <jynus>	 !log restarting backupmon1001
[08:39:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:15] <icinga-wm>	 PROBLEM - Check systemd state on pki2001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-Wikimedia_Internal_Root_CA.service,cfssl-ocsprefresh-cloud_wmnet_ca.service,cfssl-ocsprefresh-debmonitor.service,cfssl-ocsprefresh-discovery.service,cfssl-ocsprefresh-etcd.service,cfssl-ocsprefresh-kafka.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:29] <icinga-wm>	 PROBLEM - Check systemd state on pki1001 is CRITICAL: CRITICAL - degraded: The following units failed: cfssl-ocsprefresh-debmonitor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:40:31] <jynus>	 you gotta love how fast vms reboot compared to its physical counterparts :-D
[08:42:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job bacula in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[08:43:13] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P33064 and previous config saved to /var/cache/conftool/dbconfig/20220825-084313-ladsgroup.json
[08:43:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33065 and previous config saved to /var/cache/conftool/dbconfig/20220825-084326-root.json
[08:43:35] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33066 and previous config saved to /var/cache/conftool/dbconfig/20220825-084334-root.json
[08:43:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33067 and previous config saved to /var/cache/conftool/dbconfig/20220825-084342-root.json
[08:44:20] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 10%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33068 and previous config saved to /var/cache/conftool/dbconfig/20220825-084419-root.json
[08:44:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P33069 and previous config saved to /var/cache/conftool/dbconfig/20220825-084438-root.json
[08:50:15] <moritzm>	 !log installing gnutls28 security updates on bullseye
[08:50:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:53:35] <jynus>	 did we get the recovery for the bacula prometheus job?
[08:54:00] <hashar>	 good morning, I have overslept
[08:54:37] <moritzm>	 !log installing curl security updates on bullseye
[08:54:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:55:51] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[08:56:06] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506
[08:56:29] <wikibugs>	 (03PS1) 10Hashar: Revert "group1 wikis to 1.39.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826507 (https://phabricator.wikimedia.org/T316085)
[08:56:29] <jynus>	 Oh, I missed it above "(JobUnavailable) resolved:"
[08:56:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506 (owner: 10Marostegui)
[08:57:12] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup)
[08:57:13] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] "Got applied yesterday manually but I forgot to push it to Gerrit." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826507 (https://phabricator.wikimedia.org/T316085) (owner: 10Hashar)
[08:57:29] <wikibugs>	 (03PS1) 10TrainBranchBot: group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826508 (https://phabricator.wikimedia.org/T314187)
[08:57:33] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826508 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot)
[08:57:55] <James_F>	 hashar: Ha, whoops.
[08:57:56] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group1 wikis to 1.39.0-wmf.26" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826507 (https://phabricator.wikimedia.org/T316085) (owner: 10Hashar)
[08:58:05] <James_F>	 Also tsk. ;-)
[08:58:18] <wikibugs>	 (03Merged) 10jenkins-bot: group1 wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826508 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot)
[08:58:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P33070 and previous config saved to /var/cache/conftool/dbconfig/20220825-085819-ladsgroup.json
[08:58:27] <wikibugs>	 (03PS2) 10Marostegui: dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506
[08:58:31] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33071 and previous config saved to /var/cache/conftool/dbconfig/20220825-085831-root.json
[08:58:40] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33072 and previous config saved to /var/cache/conftool/dbconfig/20220825-085839-root.json
[08:58:46] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] "IPs and service on the port double-checked." [puppet] - 10https://gerrit.wikimedia.org/r/826506 (owner: 10Marostegui)
[08:58:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33073 and previous config saved to /var/cache/conftool/dbconfig/20220825-085847-root.json
[08:59:10] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1012,dbproxy1014: Replace db1164 with db1117 [puppet] - 10https://gerrit.wikimedia.org/r/826506 (owner: 10Marostegui)
[08:59:24] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 25%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33074 and previous config saved to /var/cache/conftool/dbconfig/20220825-085924-root.json
[08:59:43] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P33075 and previous config saved to /var/cache/conftool/dbconfig/20220825-085943-root.json
[09:01:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:02:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:02:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:02:26] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.39.0-wmf.26  refs T314187
[09:02:30] <stashbot>	 T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187
[09:03:04] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:05:57] <logmsgbot>	 !log hashar@deploy1002 Synchronized php: group1 wikis to 1.39.0-wmf.26  refs T314187 (duration: 03m 30s)
[09:07:50] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878)
[09:08:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[09:09:01] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Move db1164 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/826510 (https://phabricator.wikimedia.org/T316187)
[09:09:05] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) (owner: 10Ladsgroup)
[09:09:14] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[09:09:15] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[09:09:43] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Move db1164 to m2 [puppet] - 10https://gerrit.wikimedia.org/r/826510 (https://phabricator.wikimedia.org/T316187) (owner: 10Marostegui)
[09:10:09] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[09:11:36] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1013 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:12:56] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:13:06] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:13:12] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:13:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T314041)', diff saved to https://phabricator.wikimedia.org/P33077 and previous config saved to /var/cache/conftool/dbconfig/20220825-091325-ladsgroup.json
[09:13:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[09:13:30] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[09:13:36] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33078 and previous config saved to /var/cache/conftool/dbconfig/20220825-091336-root.json
[09:13:41] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2100.codfw.wmnet with reason: Maintenance
[09:13:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33079 and previous config saved to /var/cache/conftool/dbconfig/20220825-091344-root.json
[09:13:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33080 and previous config saved to /var/cache/conftool/dbconfig/20220825-091351-root.json
[09:14:22] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1015 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:14:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[09:14:29] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33081 and previous config saved to /var/cache/conftool/dbconfig/20220825-091428-root.json
[09:14:42] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
[09:14:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T314041)', diff saved to https://phabricator.wikimedia.org/P33082 and previous config saved to /var/cache/conftool/dbconfig/20220825-091447-ladsgroup.json
[09:14:48] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P33083 and previous config saved to /var/cache/conftool/dbconfig/20220825-091448-root.json
[09:15:32] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good to me. I can +2 and merge if that helps." [puppet] - 10https://gerrit.wikimedia.org/r/817907 (owner: 10Bearloga)
[09:16:14] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:16:24] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:16:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:16:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T314041)', diff saved to https://phabricator.wikimedia.org/P33084 and previous config saved to /var/cache/conftool/dbconfig/20220825-091633-ladsgroup.json
[09:18:45] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This seems fine to me. I'm happy to +2 and merge if it helps." [puppet] - 10https://gerrit.wikimedia.org/r/817903 (owner: 10Bearloga)
[09:19:03] <wikibugs>	 (03PS2) 10Ladsgroup: admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878)
[09:19:47] <hashar>	 oh nice uploads to commons looks broken
[09:20:07] <hashar>	 `/w/api.php`   PHP Warning: fopen(): Filename cannot be empty
[09:21:09] <hashar>	 and Fancy captcha have some Swift related `Iterator page I/O error.`
[09:21:58] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:22:05] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] "Tests are currently happy. Even if we don't alter GeoIP behaviour in this CR I think it's ok to have it on the VTC code to ensure that api" [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall)
[09:23:00] <hashar>	 might have been transient
[09:23:03] <wikibugs>	 (03PS1) 10Slyngshede: c:spamassassin remove cronjob, and use systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/826513
[09:23:13] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 10): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36973/console" [puppet] - 10https://gerrit.wikimedia.org/r/817907 (owner: 10Bearloga)
[09:23:28] <wikibugs>	 (03PS3) 10Ladsgroup: admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878)
[09:23:31] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove db1193 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/826514 (https://phabricator.wikimedia.org/T313569)
[09:23:33] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: Add siko to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/826509 (https://phabricator.wikimedia.org/T315878) (owner: 10Ladsgroup)
[09:23:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T312160)', diff saved to https://phabricator.wikimedia.org/P33085 and previous config saved to /var/cache/conftool/dbconfig/20220825-092356-ladsgroup.json
[09:24:01] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[09:24:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:24:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db1193 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/826514 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[09:24:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for @SiKo_WMDE - https://phabricator.wikimedia.org/T315878 (10Ladsgroup) 05Open→03Resolved You should be able to have access in half an hour.
[09:25:19] <hashar>	 I will do the rest of the wikis after our itimezone lunch or in 2-3 hours from now
[09:27:01] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] c:dynamicproxy move cronjob to systemd timer. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819505 (https://phabricator.wikimedia.org/T273673) (owner: 10Slyngshede)
[09:28:19] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36974/console" [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede)
[09:28:41] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33086 and previous config saved to /var/cache/conftool/dbconfig/20220825-092840-root.json
[09:28:46] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1192 and db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826515 (https://phabricator.wikimedia.org/T313569)
[09:28:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33087 and previous config saved to /var/cache/conftool/dbconfig/20220825-092848-root.json
[09:28:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33088 and previous config saved to /var/cache/conftool/dbconfig/20220825-092856-root.json
[09:29:30] <wikibugs>	 (03PS1) 10Ladsgroup: Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140)
[09:29:34] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33089 and previous config saved to /var/cache/conftool/dbconfig/20220825-092933-root.json
[09:30:16] <wikibugs>	 (03CR) 10Muehlenhoff: Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[09:30:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) (owner: 10Ladsgroup)
[09:31:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33090 and previous config saved to /var/cache/conftool/dbconfig/20220825-093140-ladsgroup.json
[09:32:07] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1192 and db1193 [puppet] - 10https://gerrit.wikimedia.org/r/826515 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[09:32:22] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Don't set cookies for api.wikimedia.org at the caching layer - https://phabricator.wikimedia.org/T260943 (10Vgutierrez) https://gerrit.wikimedia.org/r/824793 submitted by @BCornwall removes `WMF-Last-Access` cookie from api.wikimedia.org, as he mentioned this also remove...
[09:33:47] <wikibugs>	 (03PS2) 10Ladsgroup: Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140)
[09:35:28] <jynus>	 !log restart backup2001
[09:35:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:46] <icinga-wm>	 RECOVERY - Check systemd state on pki1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:36:08] <icinga-wm>	 RECOVERY - Check systemd state on pki2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:39:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P33091 and previous config saved to /var/cache/conftool/dbconfig/20220825-093902-ladsgroup.json
[09:39:09] <marostegui>	 !log Reboot stand by dbproxy hosts 
[09:39:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:39:34] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1021 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:43:29] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on ms-be2035 is CRITICAL: CRITICAL: Slot 3: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, 2I:4:1, 2I:4:2 - Controller: OK - Cache: Permanently Disabled - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T316194 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gat
[09:43:37] <wikibugs>	 10SRE, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10ops-monitoring-bot)
[09:43:45] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1186 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33092 and previous config saved to /var/cache/conftool/dbconfig/20220825-094345-root.json
[09:43:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1188 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33093 and previous config saved to /var/cache/conftool/dbconfig/20220825-094353-root.json
[09:44:02] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1190 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33094 and previous config saved to /var/cache/conftool/dbconfig/20220825-094401-root.json
[09:44:38] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: Repooling for the first time', diff saved to https://phabricator.wikimedia.org/P33095 and previous config saved to /var/cache/conftool/dbconfig/20220825-094438-root.json
[09:46:46] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33096 and previous config saved to /var/cache/conftool/dbconfig/20220825-094646-ladsgroup.json
[09:46:50] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[09:48:39] <wikibugs>	 10SRE, 10Security, 10cloud-services-team (Kanban): Reboot WMCS proxies - https://phabricator.wikimedia.org/T316195 (10Marostegui)
[09:49:02] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1021 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:49:40] <jynus>	 !log restart backup1002, backup2002
[09:49:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:50:22] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[09:50:26] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw
[09:50:27] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Revert setting expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/825788 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[09:50:53] <wikibugs>	 (03PS1) 10Ladsgroup: auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522
[09:51:05] <moritzm>	 !log installing libxslt security updates on bullseye
[09:51:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:51:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522 (owner: 10Ladsgroup)
[09:52:09] <wikibugs>	 (03Merged) 10jenkins-bot: Revert setting expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/825788 (https://phabricator.wikimedia.org/T252719) (owner: 10Vlad.shapik)
[09:54:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137', diff saved to https://phabricator.wikimedia.org/P33097 and previous config saved to /var/cache/conftool/dbconfig/20220825-095408-ladsgroup.json
[09:56:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P33098 and previous config saved to /var/cache/conftool/dbconfig/20220825-095611-ladsgroup.json
[09:56:52] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[09:59:28] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[09:59:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
[09:59:32] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:59:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[09:59:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P33099 and previous config saved to /var/cache/conftool/dbconfig/20220825-095942-ladsgroup.json
[09:59:48] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:00:00] <wikibugs>	 (03PS5) 10Hnowlan: install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833)
[10:00:05] <jouncebot>	 mvolz: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1000).
[10:00:10] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1174', diff saved to https://phabricator.wikimedia.org/P33100 and previous config saved to /var/cache/conftool/dbconfig/20220825-100010-root.json
[10:00:41] <wikibugs>	 (03PS2) 10Ladsgroup: auto_schema: More work on multidc support [software] - 10https://gerrit.wikimedia.org/r/826522
[10:02:36] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Productionize db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826524 (https://phabricator.wikimedia.org/T313569)
[10:02:55] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] install_server: add reimage role for sessionstore [puppet] - 10https://gerrit.wikimedia.org/r/770984 (https://phabricator.wikimedia.org/T303833) (owner: 10Hnowlan)
[10:03:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host build2001.codfw.wmnet
[10:03:46] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:04:26] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826524 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[10:04:28] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:05:54] <wikibugs>	 (03PS1) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174)
[10:06:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[10:08:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host build2001.codfw.wmnet
[10:09:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1137 (T312160)', diff saved to https://phabricator.wikimedia.org/P33102 and previous config saved to /var/cache/conftool/dbconfig/20220825-100915-ladsgroup.json
[10:09:17] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[10:09:21] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[10:09:30] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1102.eqiad.wmnet with reason: Maintenance
[10:09:36] <icinga-wm>	 PROBLEM - Check systemd state on webperf2004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_generate_svgs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:09:42] <wikibugs>	 (03PS2) 10Muehlenhoff: Stop reporting releng images to debmonitor [puppet] - 10https://gerrit.wikimedia.org/r/826211
[10:10:16] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1017 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:13:16] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:13:32] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw
[10:15:10] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.283 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:15:35] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] "LGTM, will amend 826245 once merged with the removal of `profile::docker::engine::force_default_docker_storage` if you don't want to do re" [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[10:16:31] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: server-glitch hampering deletions: backend-fail-internal - https://phabricator.wikimedia.org/T316188 (10jcrespo) This is in ongoing investigation.
[10:16:44] <icinga-wm>	 RECOVERY - Check systemd state on webperf2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:17:22] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: server-glitch hampering deletions: backend-fail-internal - https://phabricator.wikimedia.org/T316188 (10jcrespo) p:05Triage→03Unbreak!
[10:17:24] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1017 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:17:26] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:19:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P33103 and previous config saved to /var/cache/conftool/dbconfig/20220825-101930-ladsgroup.json
[10:19:36] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[10:22:37] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo)
[10:22:46] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1016 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy
[10:23:06] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad
[10:24:48] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Remove insetup from db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826526 (https://phabricator.wikimedia.org/T313569)
[10:25:30] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.241 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:25:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] site.pp: Remove insetup from db1194 [puppet] - 10https://gerrit.wikimedia.org/r/826526 (https://phabricator.wikimedia.org/T313569) (owner: 10Marostegui)
[10:27:30] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1016 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:28:17] <wikibugs>	 (03PS1) 10Marostegui: site.pp: Fix db1194 location [puppet] - 10https://gerrit.wikimedia.org/r/826527
[10:28:37] <wikibugs>	 (03CR) 10Marostegui: [V: 03+2 C: 03+2] site.pp: Fix db1194 location [puppet] - 10https://gerrit.wikimedia.org/r/826527 (owner: 10Marostegui)
[10:30:08] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[10:32:40] <wikibugs>	 (03PS1) 10Marostegui: wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/826528 (https://phabricator.wikimedia.org/T316186)
[10:33:12] <wikibugs>	 (03CR) 10David Caro: Modify maintain-dbusers.py to call the rest-api service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/809921 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe)
[10:33:40] <wikibugs>	 (03PS2) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174)
[10:33:57] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] "I tested this (by copying the script to my home directory and manually editing a deployment) and it works fine." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm)
[10:34:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P33104 and previous config saved to /var/cache/conftool/dbconfig/20220825-103436-ladsgroup.json
[10:34:45] <wikibugs>	 (03Merged) 10jenkins-bot: python39: Use shell reimplementation of webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm)
[10:37:03] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "I've re-enabled the spamassassin update timer on otrs1001 and I'm unable to reproduce the error." [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede)
[10:40:40] <wikibugs>	 (03PS5) 10Btullis: Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177)
[10:40:56] <wikibugs>	 (03CR) 10Btullis: Enable the dse-k8s-worker nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[10:42:44] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1013 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:42:51] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad
[10:44:32] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1015 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[10:49:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P33105 and previous config saved to /var/cache/conftool/dbconfig/20220825-104942-ladsgroup.json
[10:50:05] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui) I am planning to do this switchover on Monday 29th at 08:30 AM UTC. The expected impact would be around 15-30 seconds of RO time. Reads won...
[10:50:57] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui)
[10:58:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532
[10:59:22] <wikibugs>	 (03PS1) 10Vgutierrez: swift: Set sd[dz]1@ms-be1071 as failed [puppet] - 10https://gerrit.wikimedia.org/r/826533 (https://phabricator.wikimedia.org/T315437)
[10:59:27] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff)
[11:00:11] <wikibugs>	 (03PS2) 10Hnowlan: api-gateway: custom host overrides in discovery services. [deployment-charts] - 10https://gerrit.wikimedia.org/r/825729
[11:01:02] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36975/console" [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff)
[11:01:06] <wikibugs>	 (03Abandoned) 10Vgutierrez: swift: Set sd[dz]1@ms-be1071 as failed [puppet] - 10https://gerrit.wikimedia.org/r/826533 (https://phabricator.wikimedia.org/T315437) (owner: 10Vgutierrez)
[11:04:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T314041)', diff saved to https://phabricator.wikimedia.org/P33106 and previous config saved to /var/cache/conftool/dbconfig/20220825-110448-ladsgroup.json
[11:04:58] <stashbot>	 T314041: Drop old templatelinks columns and indexes - https://phabricator.wikimedia.org/T314041
[11:07:25] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[11:07:38] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1109.eqiad.wmnet with reason: Maintenance
[11:08:24] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[11:08:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2161.codfw.wmnet with reason: Maintenance
[11:08:58] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff)
[11:10:28] <wikibugs>	 10SRE, 10Search-Console-access-request: [REQUEST] Access to GSC for Wikipedia for TomekSikora.Monsoon - https://phabricator.wikimedia.org/T316212 (10soworu)
[11:11:38] <wikibugs>	 (03PS1) 10Majavah: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536
[11:13:31] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/826528 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui)
[11:14:06] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m1-master [dns] - 10https://gerrit.wikimedia.org/r/826528 (https://phabricator.wikimedia.org/T316186) (owner: 10Marostegui)
[11:14:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[11:16:31] <wikibugs>	 (03PS8) 10Hnowlan: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[11:17:13] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[11:19:55] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1 C: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36977/console" [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[11:24:07] <wikibugs>	 (03PS2) 10Majavah: Remove support for overriding LDAP client stack [puppet] - 10https://gerrit.wikimedia.org/r/826536
[11:26:44] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.163 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:27:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36980/console" [puppet] - 10https://gerrit.wikimedia.org/r/826536 (owner: 10Majavah)
[11:28:30] <wikibugs>	 (03PS9) 10Hnowlan: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[11:29:31] <marostegui>	 !log Failover m1-master
[11:29:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:30:13] <wikibugs>	 10SRE, 10SRE-tools, 10DBA, 10Infrastructure-Foundations, and 2 others: Switchover m2 master db1159 -> db1164 - https://phabricator.wikimedia.org/T316202 (10Marostegui)
[11:32:34] <godog>	 !log restart swift-proxy on ms-fe1010
[11:32:36] <godog>	 jynus: ^
[11:32:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:32:41] <jynus>	 thanks
[11:33:16] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[11:40:20] <godog>	 !log depool ms-fe1012, leave swift-proxy alone for investigation
[11:40:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:49:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340
[11:49:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340 (owner: 10Filippo Giunchedi)
[11:50:11] <godog>	 WAT
[11:50:19] <wikibugs>	 (03PS1) 10KartikMistry: CX3 Build 0.2.0+20220825 [extensions/ContentTranslation] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826341 (https://phabricator.wikimedia.org/T309986)
[11:50:54] <wikibugs>	 (03PS2) 10Filippo Giunchedi: Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340
[11:51:18] <godog>	 ok commit message reformatted, gods of CI appeased
[11:51:20] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Mark two disks as failed on ms-be1071 [puppet] - 10https://gerrit.wikimedia.org/r/826532 (owner: 10Muehlenhoff)
[11:52:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "swift: bump proxy memcache max connections" [puppet] - 10https://gerrit.wikimedia.org/r/826340 (owner: 10Filippo Giunchedi)
[11:52:46] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) We have focused on updating primarily the status page (https://www.wikimediastatus.net), but we believ...
[11:53:36] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10taavi)
[11:56:32] <icinga-wm>	 ACKNOWLEDGEMENT - Disk space on ms-be1071 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdz1 is not accessible: Input/output error Muehlenhoff T315437 https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=ms-be1071&var-datasource=eqiad+prometheus/ops
[11:56:34] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[11:56:37] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1116.eqiad.wmnet with reason: Maintenance
[11:57:52] <godog>	 !log roll-restart swift-proxy on thanos-fe* and ms-fe* (not ms-fe1012)
[11:57:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:58:11] <godog>	 jynus: FYI ^
[11:58:25] <jynus>	 thanks for the ping, I had missed that
[12:02:29] <wikibugs>	 (03PS2) 10Hnowlan: restbase: add restbase103[123] [puppet] - 10https://gerrit.wikimedia.org/r/803520
[12:03:46] <icinga-wm>	 PROBLEM - Swift https frontend on ms-fe1012 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.183 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:05:48] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:05:58] <icinga-wm>	 RECOVERY - Swift https frontend on ms-fe1012 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.010 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:06:40] <logmsgbot>	 !log filippo@cumin1001 START - Cookbook sre.hosts.downtime for 11 days, 0:00:00 on ms-fe1012.eqiad.wmnet with reason: known depooled, left for investigation
[12:06:53] <logmsgbot>	 !log filippo@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 11 days, 0:00:00 on ms-fe1012.eqiad.wmnet with reason: known depooled, left for investigation
[12:15:12] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:16:27] <wikibugs>	 (03CR) 10Ayounsi: [C: 04-1] "Nice! One change needed then lgtm." [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[12:16:29] <wikibugs>	 (03Abandoned) 10Kosta Harlan: GrowthExperiments: Enable AddLink for next round of wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/723517 (https://phabricator.wikimedia.org/T290011) (owner: 10Kosta Harlan)
[12:17:43] <wikibugs>	 (03PS2) 10Kosta Harlan: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno)
[12:17:51] <wikibugs>	 (03PS3) 10Kosta Harlan: GrowthExperiments: end mailing list campaign in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/811756 (https://phabricator.wikimedia.org/T307985) (owner: 10Sergio Gimeno)
[12:19:54] <wikibugs>	 (03PS3) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174)
[12:19:56] <wikibugs>	 (03PS3) 10Kosta Harlan: Declare mediawiki.createaccount_blocked_user schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822686 (https://phabricator.wikimedia.org/T306018) (owner: 10Sergio Gimeno)
[12:20:59] <wikibugs>	 (03CR) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[12:24:26] <wikibugs>	 (03PS7) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[12:24:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[12:31:06] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Enable the dse-k8s-worker nodes [puppet] - 10https://gerrit.wikimedia.org/r/826240 (https://phabricator.wikimedia.org/T310177) (owner: 10Btullis)
[12:31:13] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Testing a script
[12:31:27] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Testing a script
[12:34:40] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:34:43] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:34:49] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2114 (T316186)', diff saved to https://phabricator.wikimedia.org/P33108 and previous config saved to /var/cache/conftool/dbconfig/20220825-123448-ladsgroup.json
[12:35:33] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.reboot-single for host db2114.codfw.wmnet
[12:38:24] <icinga-wm>	 PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[12:39:26] <hashar>	 jouncebot: now
[12:39:26] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 20 minute(s)
[12:39:45] <hashar>	 I am going to promote the rest of the wikis to 1.39.0-wmf.26
[12:40:28] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826554 (https://phabricator.wikimedia.org/T314187)
[12:40:30] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826554 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot)
[12:40:48] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1001.eqiad.wmnet
[12:41:11] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.39.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826554 (https://phabricator.wikimedia.org/T314187) (owner: 10TrainBranchBot)
[12:44:20] <kart_>	 hashar: Can I go ahead with my wmf.26 backport patch as scheduled in approx 15 min?
[12:45:17] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.39.0-wmf.26  refs T314187
[12:45:21] <stashbot>	 T314187: 1.39.0-wmf.26 deployment blockers - https://phabricator.wikimedia.org/T314187
[12:46:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db2114.codfw.wmnet
[12:46:36] <icinga-wm>	 PROBLEM - mysqld processes on db2114 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:46:36] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s6 on db2114 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:46:36] <icinga-wm>	 PROBLEM - MariaDB Replica IO: s6 on db2114 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:46:38] <icinga-wm>	 PROBLEM - MariaDB read only s6 on db2114 is CRITICAL: Could not connect to localhost:3306 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[12:46:48] <marostegui>	 uh?
[12:46:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on dse-k8s-worker1003:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:46:59] <marostegui>	 that's the candidate master
[12:47:24] <marostegui>	 Amir1:  ^
[12:47:42] <Amir1>	 marostegui: rebooting it, it should come back online
[12:47:49] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[12:47:49] <Amir1>	 I personally downtimed it for a day
[12:48:06] <marostegui>	 Amir1: but the alert arrived?
[12:48:31] <Amir1>	 ah, didn't see it failed 
[12:48:32] <Amir1>	 (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host db2114.codfw.wmnet
[12:48:34] <Amir1>	 sigh
[12:48:43] <Amir1>	 how downtime fails :/
[12:48:45] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[12:48:47] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[12:48:47] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1001.eqiad.wmnet
[12:49:06] <Amir1>	 previous ones passed tho  (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[12:49:41] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1002.eqiad.wmnet
[12:49:41] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[12:50:21] <Amir1>	 sigh, why uptime has different value 
[12:50:28] <Amir1>	 anyway, separate issue
[12:51:04] <marostegui>	 Amir1: and mysql isn't up either
[12:51:08] <marostegui>	 is that expected?
[12:51:11] <Amir1>	 yeah, on it
[12:51:15] <marostegui>	 cool np
[12:51:40] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] sre: alert on appserver unavailability [alerts] - 10https://gerrit.wikimedia.org/r/825741 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi)
[12:51:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (4) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:52:44] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] Stop reporting releng images to debmonitor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[12:53:42] <icinga-wm>	 RECOVERY - mysqld processes on db2114 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[12:53:42] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s6 on db2114 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:53:42] <icinga-wm>	 RECOVERY - MariaDB Replica IO: s6 on db2114 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:53:42] <icinga-wm>	 RECOVERY - MariaDB read only s6 on db2114 is OK: Version 10.4.25-MariaDB-log, Uptime 109s, read_only: True, event_scheduler: True, 1526.83 QPS, connection latency: 0.004922s, query latency: 0.000788s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only
[12:54:09] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kubestage: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826229 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[12:54:30] <Amir1>	 I think I know why it's erroring out, the cookbook removes donwtime
[12:54:51] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] kubernetes: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826236 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[12:55:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] ml-staging: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826233 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[12:55:58] <wikibugs>	 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo) p:05Unbreak!→03High We believe this is solved now- RFO seemed to be an iss...
[12:56:21] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] ml-serve: cleanup profile::docker::storage [puppet] - 10https://gerrit.wikimedia.org/r/826238 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[12:56:58] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (4) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:57:30] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1002.eqiad.wmnet
[12:58:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2114', diff saved to https://phabricator.wikimedia.org/P33109 and previous config saved to /var/cache/conftool/dbconfig/20220825-125806-ladsgroup.json
[12:58:07] <hashar>	 marostegui: Amir1: has MediaWiki overloaded that s6 db2114 database or is that unrelated?
[12:58:23] <marostegui>	 hashar: Unrelated
[12:58:25] <Amir1>	 db2114 is codfw, not getting any traffic
[12:58:28] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (2) rsyslog on dse-k8s-worker1004:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:58:35] <hashar>	 great thank you for the confirmation
[12:58:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (5) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[12:58:59] <Amir1>	 marostegui: let me try another thing for the next restart, is that fine with you?
[13:00:05] <jouncebot>	 Deploy window Mobileapps/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1300)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, and awight: My dear minions, it's time we take the moon! Just kidding. Time for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1300).
[13:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:19] * kart_ is here
[13:00:39] <kart_>	 hashar: Can I go ahead for backport deployment?
[13:00:42] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[13:00:44] <logmsgbot>	 !log ladsgroup@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db2114.codfw.wmnet with reason: Maintenance
[13:01:44] <wikibugs>	 (03PS1) 10Ayounsi: Inital FHRP support [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/826559 (https://phabricator.wikimedia.org/T311218)
[13:02:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (6) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:02:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[13:02:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2117.codfw.wmnet with reason: Maintenance
[13:02:36] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2117 (T316186)', diff saved to https://phabricator.wikimedia.org/P33110 and previous config saved to /var/cache/conftool/dbconfig/20220825-130235-ladsgroup.json
[13:02:52] <kart_>	 hashar: ping ping :)
[13:07:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (7) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:08:26] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "Great, thanks for doing this!" [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[13:08:52] <wikibugs>	 (03PS1) 10Ayounsi: Add FHRP group support to generate_dns_snippets [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/826560 (https://phabricator.wikimedia.org/T311218)
[13:09:18] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911)
[13:09:41] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1003 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:09:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2117', diff saved to https://phabricator.wikimedia.org/P33111 and previous config saved to /var/cache/conftool/dbconfig/20220825-130950-ladsgroup.json
[13:10:45] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1007 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:10:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Commons, 10Wikimedia-Incident, 10Wikimedia-production-error: Ongoing media storage errors: backend-fail-internal on deletions and ~2000 read errors/s - https://phabricator.wikimedia.org/T316188 (10jcrespo)
[13:11:39] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911)
[13:11:51] <hashar>	 kart_: yeah sorry
[13:12:10] <hashar>	 was digging in grafana and logs
[13:12:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on dse-k8s-worker1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:12:38] <hashar>	 I am trying the `scap backport` command
[13:12:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved via scap backport" [extensions/ContentTranslation] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826341 (https://phabricator.wikimedia.org/T309986) (owner: 10KartikMistry)
[13:12:45] <kart_>	 hashar: OK. Going ahead. Will take 15 min to merge anyway..
[13:12:52] <hashar>	 ah
[13:13:00] <hashar>	 we I should have +2 ed it ahead of time
[13:13:07] <hashar>	 and really should speed up those CI jobs
[13:13:27] <hashar>	 I found a potential opitmization to bring the selenium one from ~15 to 10 which would help
[13:14:19] <kart_>	 cool. I see patch is merged via scap backport?
[13:14:24] <kart_>	 being merged..
[13:14:26] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[13:14:26] <hashar>	 that `scap backport` is quite great. It found out the patch from the Deployments page, found it the change is open and +2ed it
[13:14:41] <hashar>	 now it waits for the merge to happen
[13:14:56] <hashar>	 13:12:45 Waiting for changes to be merged. This may take some time if there are long running tests.
[13:14:56] <hashar>	 Change 826341 status: NEW, mergeable: True
[13:14:56] <hashar>	 Change 826341 status: NEW, mergeable: True
[13:15:12] <kart_>	 and will it do all magic? or should I need to normal scap run?
[13:16:01] <jayme>	 btullis: kubelet not starting is probably a cgroup issue (with bullseye only mounting cgroup v2)
[13:17:16] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[13:17:23] <jayme>	 btullis: yep...there was a manual change needed (https://phabricator.wikimedia.org/T300744#7700797)
[13:17:29] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2124.codfw.wmnet with reason: Maintenance
[13:17:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2124 (T316186)', diff saved to https://phabricator.wikimedia.org/P33112 and previous config saved to /var/cache/conftool/dbconfig/20220825-131735-ladsgroup.json
[13:18:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[13:18:16] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (NOOP 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36981/console" [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[13:18:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Disable origin coalescing [puppet] - 10https://gerrit.wikimedia.org/r/825727 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[13:19:50] <vgutierrez>	 !log disable origin coalescing in ats-be globally - T315911
[13:19:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:19:54] <stashbot>	 T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911
[13:20:06] <hashar>	 kart_: i think it does all the magic yes
[13:20:21] <hashar>	 the idea releng has is to make the deployment as automated as possible
[13:20:34] <hashar>	 so that in theory anyone can process the deployments with just a few lines of documentation
[13:21:55] <kart_>	 hashar: can you point me to scap backport document?
[13:22:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (3) rsyslog on dse-k8s-worker1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:23:03] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1004 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:23:16] <hashar>	 kart_: I don't think it is documented yet
[13:23:28] <jinxer-wm>	 (KubernetesRsyslogDown) resolved: (3) rsyslog on dse-k8s-worker1005:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:23:30] <kart_>	 ah.
[13:23:40] <hashar>	 https://doc.wikimedia.org/scap/search.html?q=backport gives nothing and the wiki doc at https://wikitech.wikimedia.org/wiki/Backport_windows/Deployers does not mention it yet
[13:23:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T316186)', diff saved to https://phabricator.wikimedia.org/P33113 and previous config saved to /var/cache/conftool/dbconfig/20220825-132356-ladsgroup.json
[13:24:29] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1008 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:24:53] <hashar>	 kart_: I have asked in our team channel. I am guessing it is not ready yet for wide spread adoption
[13:25:13] <hashar>	 I have reviewed a patch to it yesterday
[13:26:18] <kart_>	 hashar: I hope it won't break anything :D
[13:28:58] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[13:30:37] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan)
[13:31:39] <wikibugs>	 (03Merged) 10jenkins-bot: CX3 Build 0.2.0+20220825 [extensions/ContentTranslation] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826341 (https://phabricator.wikimedia.org/T309986) (owner: 10KartikMistry)
[13:32:13] <logmsgbot>	 !log hashar@deploy1002 Started scap: Backport for [[gerrit:826341|CX3 Build 0.2.0+20220825 (T309986 T301222)]]
[13:32:18] <stashbot>	 T309986: Persist selection of translation service across sessions - https://phabricator.wikimedia.org/T309986
[13:32:18] <stashbot>	 T301222: Instrumentation of new SX entrypoints - https://phabricator.wikimedia.org/T301222
[13:33:04] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1005 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:33:16] <hashar>	 hmm
[13:33:19] <hashar>	 kart_: looks like it works
[13:33:29] <wikibugs>	 (03PS1) 10Joal: Add linktarget to sqooped tables [puppet] - 10https://gerrit.wikimedia.org/r/826564 (https://phabricator.wikimedia.org/T314666)
[13:34:55] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: increase num_workers to 4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/820117 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan)
[13:35:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[13:35:44] <hashar>	 kart_: looks like the `scap backport` script runs a full sync directly bypassing the manual verification steps through `mwdebug*` hosts
[13:35:53] <kart_>	 ah.
[13:36:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[13:36:12] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[13:36:24] <kart_>	 hashar: That's fine. Patch is tested in master already.
[13:36:46] <kart_>	 But, would love to see mwdebug* deploy first.
[13:37:00] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[13:38:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[13:38:44] <icinga-wm>	 RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[13:39:03] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P33114 and previous config saved to /var/cache/conftool/dbconfig/20220825-133902-ladsgroup.json
[13:39:18] <wikibugs>	 (03Merged) 10jenkins-bot: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826525 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[13:39:38] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1003.eqiad.wmnet
[13:40:16] <hashar>	 hmm
[13:40:25] <hashar>	 Changes synced to: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet.
[13:40:25] <hashar>	 Please do any necessary checks before continuing.
[13:40:28] <hashar>	 kart_: I was wrong ;)
[13:41:25] <hashar>	 so you can test on mwdebug hosts or I can `Y` to do the full deployment
[13:41:41] <hashar>	 (sorry I am learning about that command)
[13:42:00] <wikibugs>	 10SRE, 10Analytics-Radar, 10Machine-Learning-Team: Using docker in WMF production network outside of kubernetes - https://phabricator.wikimedia.org/T275551 (10akosiaris) >>! In T275551#8176053, @fkaelin wrote: > Reviving this discussion, though I renamed the phab to "Running docker containers in a non-produc...
[13:42:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1120.eqiad.wmnet with reason: Maintenance
[13:43:13] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1120.eqiad.wmnet with reason: Maintenance
[13:43:19] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1120 (T312160)', diff saved to https://phabricator.wikimedia.org/P33115 and previous config saved to /var/cache/conftool/dbconfig/20220825-134318-ladsgroup.json
[13:43:23] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[13:43:45] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Krinkle)
[13:43:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use  FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10ayounsi) The two patches above should allow us to use the `FHRP group` feature in production, without leveraging additional fields like priority or...
[13:44:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Krinkle)
[13:44:23] <hashar>	 kart_: I am syncing it
[13:45:10] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-worker1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:45:22] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1003.eqiad.wmnet
[13:47:54] <icinga-wm>	 PROBLEM - Check systemd state on dse-k8s-worker1006 is CRITICAL: CRITICAL - degraded: The following units failed: kubelet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:48:31] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) Instead let's move these to a baremetal host instead? We're hitting some limits of what makes sense with Ganeti for these, one other issue is high rate...
[13:49:26] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911)
[13:49:55] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10MoritzMuehlenhoff) That would also be a fine opportunity to move away from the confusing naming scheme, given that webperf1003 and 1004 are totally different services, so...
[13:52:07] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36982/console" [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[13:54:09] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124', diff saved to https://phabricator.wikimedia.org/P33116 and previous config saved to /var/cache/conftool/dbconfig/20220825-135408-ladsgroup.json
[13:56:14] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "Thanks for the fixes!" [puppet] - 10https://gerrit.wikimedia.org/r/826490 (owner: 10Andrea Denisse)
[13:57:10] <logmsgbot>	 !log hashar@deploy1002 Finished scap: Backport for [[gerrit:826341|CX3 Build 0.2.0+20220825 (T309986 T301222)]] (duration: 24m 56s)
[13:57:12] <wikibugs>	 (03PS1) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196)
[13:57:19] <stashbot>	 T309986: Persist selection of translation service across sessions - https://phabricator.wikimedia.org/T309986
[13:57:19] <stashbot>	 T301222: Instrumentation of new SX entrypoints - https://phabricator.wikimedia.org/T301222
[13:57:51] <kart_>	 hashar: Thanks!
[13:58:21] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: alerts to use yearly rotation [puppet] - 10https://gerrit.wikimedia.org/r/826385 (https://phabricator.wikimedia.org/T304924) (owner: 10Cwhite)
[13:58:33] <wikibugs>	 (03PS2) 10Hnowlan: changeprop: make num_workers configurable for jobqueue [deployment-charts] - 10https://gerrit.wikimedia.org/r/826570 (https://phabricator.wikimedia.org/T233196)
[13:59:20] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: set ecs routing only when the output is logstash [puppet] - 10https://gerrit.wikimedia.org/r/826384 (https://phabricator.wikimedia.org/T305013) (owner: 10Cwhite)
[14:00:25] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: add tcpircbot logging tests [puppet] - 10https://gerrit.wikimedia.org/r/824317 (https://phabricator.wikimedia.org/T257861) (owner: 10Cwhite)
[14:01:57] <wikibugs>	 (03PS5) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911)
[14:02:48] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: use rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824316 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[14:02:54] <wikibugs>	 (03PS1) 10Btullis: Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174)
[14:04:54] <wikibugs>	 (03PS1) 10Milimetric: Add datahub lineage plugin to the build [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/826573
[14:05:41] <wikibugs>	 (03CR) 10Herron: [C: 03+1] rsyslog: add rsyslog-namespaced fields to syslog_cee [puppet] - 10https://gerrit.wikimedia.org/r/824315 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[14:06:12] <wikibugs>	 (03CR) 10Milimetric: "Adding the latest version of this plugin.  It should be forwards-compatible, so hopefully doesn't need lots of updating.  But we may want " [debs/airflow] (debian) - 10https://gerrit.wikimedia.org/r/826573 (owner: 10Milimetric)
[14:06:25] <wikibugs>	 (03PS6) 10Vgutierrez: trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911)
[14:06:31] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[14:07:33] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host people2002.codfw.wmnet
[14:07:51] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[14:08:07] <hashar>	 kart_: you are welcome, and sorry for the delay
[14:08:11] <wikibugs>	 (03CR) 10Ayounsi: Add btullis to users to allow for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[14:09:12] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Fix open_write_fail_action values [puppet] - 10https://gerrit.wikimedia.org/r/825709 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[14:09:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2124 (T316186)', diff saved to https://phabricator.wikimedia.org/P33117 and previous config saved to /var/cache/conftool/dbconfig/20220825-140915-ladsgroup.json
[14:11:00] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+1] Varnish: Stop sending analytics cookies to API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall)
[14:11:15] <wikibugs>	 (03CR) 10Herron: [C: 03+1] logstash: add support for rsyslog-namespaced fields [puppet] - 10https://gerrit.wikimedia.org/r/824314 (https://phabricator.wikimedia.org/T315500) (owner: 10Cwhite)
[14:11:28] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2002.codfw.wmnet
[14:11:33] <wikibugs>	 (03CR) 10Btullis: Add btullis to users to allow for router configuration (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174) (owner: 10Btullis)
[14:13:24] <claime>	 !log rebooting people1003 (people.wikimedia.org)
[14:13:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:35] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-single for host people1003.eqiad.wmnet
[14:15:58] <claime>	 !log finished rebooting people1003 (people.wikimedia.org)
[14:16:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:28] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1003.eqiad.wmnet
[14:20:08] <wikibugs>	 (03PS1) 10Btullis: Revert "Add BGP neighbor data for the new dse-k8s cluster" [homer/public] - 10https://gerrit.wikimedia.org/r/826344
[14:20:47] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-worker1004.eqiad.wmnet
[14:21:09] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Add BGP neighbor data for the new dse-k8s cluster" [homer/public] - 10https://gerrit.wikimedia.org/r/826344 (owner: 10Btullis)
[14:21:52] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "Add BGP neighbor data for the new dse-k8s cluster" [homer/public] - 10https://gerrit.wikimedia.org/r/826344 (owner: 10Btullis)
[14:23:02] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Enable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911)
[14:24:10] <icinga-wm>	 RECOVERY - Check systemd state on dse-k8s-worker1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:24:14] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/36983/console" [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[14:24:45] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] Remove zookeeper_version [puppet] - 10https://gerrit.wikimedia.org/r/818863 (https://phabricator.wikimedia.org/T312539) (owner: 10Muehlenhoff)
[14:24:49] <wikibugs>	 (03PS1) 10FNegri: Add cloudcephosd1029 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870)
[14:28:36] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-worker1004.eqiad.wmnet
[14:29:06] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] trafficserver: Enable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[14:29:14] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[14:29:28] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance
[14:30:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578
[14:30:51] <wikibugs>	 (03PS2) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578
[14:31:01] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] trafficserver: Enable origin coalescing in cp600[78] [puppet] - 10https://gerrit.wikimedia.org/r/826576 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[14:32:22] <vgutierrez>	 !log enable origin coalescing in ats-be@cp600[78] [expect crashes] - T315911
[14:32:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:27] <stashbot>	 T315911: ATS Read While Writer feature is wrongly configured - https://phabricator.wikimedia.org/T315911
[14:32:31] <vgutierrez>	 gotta love my optimism
[14:34:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578 (owner: 10Muehlenhoff)
[14:34:14] <sukhe>	 :P
[14:35:21] <wikibugs>	 (03PS1) 10Btullis: Add BGP neighbor data for the new dse-k8s cluster [homer/public] - 10https://gerrit.wikimedia.org/r/826579 (https://phabricator.wikimedia.org/T310174)
[14:35:40] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[14:35:45] <stashbot>	 T311686: Upgrade ganeti/codfw to Bullseye - https://phabricator.wikimedia.org/T311686
[14:35:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5 days, 0:00:00 on ganeti2025.codfw.wmnet with reason: Remove node for eventual reimage, T311686
[14:36:09] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel
[14:36:22] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New Kernel
[14:37:04] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[14:42:16] <wikibugs>	 (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/803520 (owner: 10Hnowlan)
[14:42:26] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel
[14:42:39] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New Kernel
[14:43:54] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel
[14:44:08] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New Kernel
[14:44:55] <wikibugs>	 (03PS3) 10Muehlenhoff: Cookbooks to perform rolling restart/reboot of an LDAP replica cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/826578
[14:46:39] <cbogen>	 hi mutante: Andrew Otto suggested I reach out to you to see if you could help us get this patch merged https://gerrit.wikimedia.org/r/c/operations/puppet/+/811312
[14:47:36] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[14:49:06] <wikibugs>	 (03PS1) 10Hnowlan: Add blubber config file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/826585 (https://phabricator.wikimedia.org/T312104)
[14:49:39] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) Dell technician will be on site today between 10am CT and 2pm. Is is possible to get this server offline for the back plane replacement?  Thanks
[14:51:40] <wikibugs>	 (03CR) 10FNegri: Add cloudcephosd1029 to the Ceph pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[14:51:50] <wikibugs>	 (03CR) 10FNegri: [C: 03+2] Add cloudcephosd1029 to the Ceph pool [puppet] - 10https://gerrit.wikimedia.org/r/826577 (https://phabricator.wikimedia.org/T314870) (owner: 10FNegri)
[14:52:36] <wikibugs>	 10SRE, 10Image-Suggestions, 10Patch-For-Review, 10Structured-Data-Backlog (Current Work): [M] Schedule image suggestions notifications - https://phabricator.wikimedia.org/T300024 (10CBogen) Tagging #sre in hopes that someone on clinic duty can help us get this patch merged, thanks!
[14:53:13] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Add blubber config file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/826585 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[14:54:44] <wikibugs>	 (03Merged) 10jenkins-bot: Add blubber config file [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/826585 (https://phabricator.wikimedia.org/T312104) (owner: 10Hnowlan)
[14:56:45] <wikibugs>	 (03PS1) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911)
[14:57:32] <wikibugs>	 (03PS8) 10BCornwall: varnish: Stop sending analytics cookies to API [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943)
[14:59:54] <wikibugs>	 (03PS1) 10DCausse: wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703)
[15:00:36] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ottomata) Approved.
[15:01:04] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup)
[15:01:12] <wikibugs>	 (03PS3) 10Ladsgroup: Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140)
[15:01:19] <wikibugs>	 (03PS2) 10Btullis: Add btullis to users to allow for router configuration [homer/public] - 10https://gerrit.wikimedia.org/r/826572 (https://phabricator.wikimedia.org/T310174)
[15:01:21] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Admin: Add Amanda Bittaker to analytics-private-data [puppet] - 10https://gerrit.wikimedia.org/r/826516 (https://phabricator.wikimedia.org/T316140) (owner: 10Ladsgroup)
[15:03:48] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse)
[15:03:54] <wikibugs>	 (03PS2) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911)
[15:04:15] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ladsgroup) 05Open→03Resolved You should be able to access it in half an hour or so. If not, please reopen this ticket. Thank you for flying with Wikimedia SRE.
[15:07:16] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] "Moritz and I talked about it this morning, then we had a Swift outage and I was dealing with the MediaWiki train. It is a bit late to get " [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[15:09:23] <wikibugs>	 (03PS8) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[15:10:25] <wikibugs>	 10SRE, 10ops-codfw, 10Discovery-Search: elastic2054 is down with memory error - https://phabricator.wikimedia.org/T315989 (10Papaul) 05Open→03Resolved memory replaced, system is back online.
[15:13:38] <icinga-wm>	 RECOVERY - ElasticSearch numbers of masters eligible - 9643 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Expected_eligible_masters_check_and_alert
[15:15:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:16:17] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] C:profile::docker::storage removal and cleanup [puppet] - 10https://gerrit.wikimedia.org/r/826245 (https://phabricator.wikimedia.org/T315977) (owner: 10Clément Goubert)
[15:17:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120 (T312160)', diff saved to https://phabricator.wikimedia.org/P33118 and previous config saved to /var/cache/conftool/dbconfig/20220825-151731-ladsgroup.json
[15:17:36] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Dzahn) +1 to not using the same names for the different webperf roles, thought the same before, should match more the puppet role
[15:17:37] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[15:18:48] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Update VE core submodule to master (d4c438548) [extensions/VisualEditor] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826345 (https://phabricator.wikimedia.org/T316219)
[15:18:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: Resize webperf1004/2004 VM for arc-lamp - https://phabricator.wikimedia.org/T316223 (10Dzahn) And yea, like the history says the discussion was to start from scratch once we get over the 16GB RAM limit. Hardware sounds the right way indeed.
[15:19:47] <wikibugs>	 (03PS9) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[15:22:44] <icinga-wm>	 PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[15:22:44] <wikibugs>	 (03Abandoned) 10BCornwall: admin: Add SSH key to mraish user [puppet] - 10https://gerrit.wikimedia.org/r/818397 (https://phabricator.wikimedia.org/T313429) (owner: 10Vgutierrez)
[15:23:20] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Grant private data access to Amanda Bittaker - https://phabricator.wikimedia.org/T316140 (10Ottomata) (^ lol)
[15:23:43] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[15:23:46] <wikibugs>	 (03CR) 10BCornwall: varnish: Stop sending analytics cookies to API (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/824793 (https://phabricator.wikimedia.org/T260943) (owner: 10BCornwall)
[15:23:56] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2158.codfw.wmnet with reason: Maintenance
[15:23:58] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[15:24:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2095.codfw.wmnet with reason: Maintenance
[15:24:17] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33119 and previous config saved to /var/cache/conftool/dbconfig/20220825-152417-ladsgroup.json
[15:26:46] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[15:27:07] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 20s)
[15:27:22] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[15:27:50] <wikibugs>	 (03CR) 10Dzahn: "yep, all sounds good to me. back to this on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/826237 (https://phabricator.wikimedia.org/T315942) (owner: 10Hashar)
[15:29:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T316186)', diff saved to https://phabricator.wikimedia.org/P33120 and previous config saved to /var/cache/conftool/dbconfig/20220825-152932-ladsgroup.json
[15:30:41] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Auth extremely slow on clouddumps100[12] - https://phabricator.wikimedia.org/T316123 (10Andrew) + Moritz because I think he had a patch in the works.  If not let me know and I can likely figure it out :)
[15:31:37] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[15:31:47] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[15:31:47] <icinga-wm>	 PROBLEM - Host ores2009.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[15:32:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120', diff saved to https://phabricator.wikimedia.org/P33121 and previous config saved to /var/cache/conftool/dbconfig/20220825-153237-ladsgroup.json
[15:33:00] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) a:05Andrew→03cmooney This additional range was set up by @cmooney -- Cathal, is this something you can document as needed?
[15:39:23] <wikibugs>	 (03PS2) 10DCausse: wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703)
[15:41:47] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[15:41:57] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[15:42:45] <jynus>	 !log restart backup1002 (interrupted before), backup1003, backup2003
[15:42:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:39] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P33122 and previous config saved to /var/cache/conftool/dbconfig/20220825-154438-ladsgroup.json
[15:47:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120', diff saved to https://phabricator.wikimedia.org/P33123 and previous config saved to /var/cache/conftool/dbconfig/20220825-154743-ladsgroup.json
[15:50:14] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[15:50:23] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[15:52:07] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[15:52:16] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[15:54:02] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P33124 and previous config saved to /var/cache/conftool/dbconfig/20220825-155401-ladsgroup.json
[15:54:34] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[15:54:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[15:55:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2169.codfw.wmnet with reason: Maintenance
[15:55:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33125 and previous config saved to /var/cache/conftool/dbconfig/20220825-155506-ladsgroup.json
[15:55:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2169:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33126 and previous config saved to /var/cache/conftool/dbconfig/20220825-155529-ladsgroup.json
[15:57:07] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse)
[16:00:05] <jouncebot>	 jbond and rzl: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:31] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[16:00:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33127 and previous config saved to /var/cache/conftool/dbconfig/20220825-160036-ladsgroup.json
[16:00:40] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[16:00:52] <wikibugs>	 (03Abandoned) 10Andrew Bogott: OpenStack nova.conf: set reclaim_instance_interval to half an hour [puppet] - 10https://gerrit.wikimedia.org/r/798772 (owner: 10Andrew Bogott)
[16:01:13] <wikibugs>	 (03PS1) 10Milimetric: airflow: disable lazy loading plugins [puppet] - 10https://gerrit.wikimedia.org/r/826600
[16:01:22] <wikibugs>	 (03PS1) 10Ori: Increase roll-out of query-sorting to 15% [puppet] - 10https://gerrit.wikimedia.org/r/826601 (https://phabricator.wikimedia.org/T314868)
[16:02:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1120 (T312160)', diff saved to https://phabricator.wikimedia.org/P33128 and previous config saved to /var/cache/conftool/dbconfig/20220825-160250-ladsgroup.json
[16:02:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[16:02:54] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance
[16:02:55] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[16:04:27] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T314509 (10Papaul) p:05Triage→03Medium
[16:04:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: thanos-be2002 sdj failed - https://phabricator.wikimedia.org/T314913 (10Papaul) p:05Triage→03Medium
[16:07:22] <wikibugs>	 (03Merged) 10jenkins-bot: Set expiry headers on thumbnails [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/489022 (https://phabricator.wikimedia.org/T211661) (owner: 10Gilles)
[16:07:23] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[16:07:32] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[16:07:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Good afternoon Papaul,  I have submitted DPS 432866984 for the replacement backplane to ship out. Service is scheduled for Thursday 08/25/22. The tech w...
[16:08:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) Dell technician will be on site today between 10am CT and 2pm. Is is possible to get this server offline for the back plane replacement?  Thanks
[16:14:10] <wikibugs>	 (03PS1) 10Hashar: doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604
[16:15:05] <wikibugs>	 (03PS3) 10DCausse: wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703)
[16:15:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P33129 and previous config saved to /var/cache/conftool/dbconfig/20220825-161544-ladsgroup.json
[16:18:21] <wikibugs>	 (03PS3) 10Hashar: doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541)
[16:18:23] <wikibugs>	 (03PS2) 10Hashar: doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604
[16:19:28] <wikibugs>	 (03CR) 10DCausse: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse)
[16:19:30] <wikibugs>	 (03CR) 10Hashar: "Daniel,  got the documentation from your change introducing httpbb tests for doc.wikimedia.org  415616c37394d300700a6810797760e53aa702b3" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:19:38] <wikibugs>	 (03PS2) 10Milimetric: airflow: disable lazy plugins and add datahub conn [puppet] - 10https://gerrit.wikimedia.org/r/826600
[16:21:14] <wikibugs>	 (03CR) 10Hashar: "The back compatibility Apache redirects got broken at some point in the past. This convert them to Rewrite rules which I have tested local" [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar)
[16:23:00] <wikibugs>	 (03PS1) 10Dzahn: Revert "Revert "c:spamassassin move Spamassassin updates from crontab"" [puppet] - 10https://gerrit.wikimedia.org/r/826607
[16:23:56] <wikibugs>	 (03CR) 10Dzahn: "@AOkoth Could you maybe take this and see if you can reproduce and catch the error we saw yesterday?" [puppet] - 10https://gerrit.wikimedia.org/r/826607 (owner: 10Dzahn)
[16:24:03] <icinga-wm>	 RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[16:27:34] <wikibugs>	 (03CR) 10Dzahn: "yep, confirmed it works that way. the only problem is of course the part that tests come after deployment." [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:28:08] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:28:21] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:28:54] <hashar>	 mutante: I wanted to explore how to provision an apache from puppet and run httpbb against that but gave up. It is a long tail of complexity :)
[16:29:18] <hashar>	 I guess one way is to deploy the httpbb tests on the deployment server and the target host then run the tests manually
[16:29:23] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "one thing though. if the tests are not changed and succeed both before and after the redirect change.. then aren't they missing tests to t" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:29:46] <wikibugs>	 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, and 2 others: Migrate thumbor to Kubernetes - https://phabricator.wikimedia.org/T233196 (10hnowlan)
[16:30:13] <wikibugs>	 (03PS1) 10Jdrewniak: Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134)
[16:30:43] <wikibugs>	 (03PS3) 10Ebernhardson: query_service: Avoid passing content body to internal auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/825925 (https://phabricator.wikimedia.org/T306899)
[16:30:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316', diff saved to https://phabricator.wikimedia.org/P33130 and previous config saved to /var/cache/conftool/dbconfig/20220825-163050-ladsgroup.json
[16:32:36] <wikibugs>	 (03PS3) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911)
[16:32:44] <wikibugs>	 (03CR) 10Bking: [C: 03+2] wcqs: enable proper URI schemes for commons [puppet] - 10https://gerrit.wikimedia.org/r/826589 (https://phabricator.wikimedia.org/T314703) (owner: 10DCausse)
[16:33:38] <mutante>	 hashar: for doc specifically, we had a test setup in devtools but gave up on it afair
[16:34:04] <mutante>	 hashar: for mw appservers the way we do it is to disable puppet on mw*, enable it only on mwdebug, run puppet, run tests.. if we like it..enable puppet on all
[16:34:10] <hashar>	 yeah I built that when I have split the published artifacts to their own dir (`/srv/doc` iirc)
[16:34:15] <mutante>	 of course we dont have docdebug
[16:35:15] <mutante>	 hashar: i think the realistic way to test is to disable puppet on doc1002, run puppet on doc2001, run test against doc2001, enable on both
[16:35:43] <mutante>	 hashar: but re: setup apache in cloud VPS, I made the role simplelamp2 for that, just apply and setups apache
[16:35:49] <hashar>	 possibly yes. I guess we will find out next time we have a big Apache configuration change to make
[16:36:51] <mutante>	 so you have a change that fixes something, but if the tests work already before the fix..maybe it is missing a test for something
[16:37:09] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: backplane replacement
[16:37:23] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ms-be2067.codfw.wmnet with reason: backplane replacement
[16:37:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c4a39dbe-2fb0-4745-99c3-76e40de3820e) set by eevans@cumin1001 for 1 day, 0:00:00 on 1 host(s) a...
[16:37:56] <wikibugs>	 (03CR) 10Hashar: doc: document how to run httpbb tests (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:38:14] <hashar>	 mutante: yeah I get your point, then the redirects are currently broken ;)
[16:38:30] <hashar>	 I could theorically write a test which shows they give a 404
[16:38:47] <hashar>	 then amend the current changes which would replace the 404 tests by 302 ones
[16:39:08] <hashar>	 but I don't think that adds any value in this case
[16:39:36] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "my point was just to add tests for the "compat URLs" because I notice you say they are broken but all the tests succeed" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:40:03] <urandom>	 !log shutting down ms-be2067.codfw.wmnet for backplane replacement -- T314049
[16:40:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:40:09] <stashbot>	 T314049: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049
[16:40:48] <wikibugs>	 (03PS10) 10David Caro: ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870)
[16:41:11] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Eevans) @Papaul the host is shut down; Please let me know as soon as it's back up
[16:42:21] <wikibugs>	 (03PS3) 10Dzahn: doc: document how to run httpbb tests [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:42:44] <wikibugs>	 (03PS2) 10Bernard Wang: Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) (owner: 10Jdrewniak)
[16:42:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "rebased, merging, comments-only" [puppet] - 10https://gerrit.wikimedia.org/r/826604 (owner: 10Hashar)
[16:44:20] <hashar>	 I think I broke the existing `Redirect` when introducing the `RewriteRule`
[16:44:29] <hashar>	 apache is full of surprises
[16:45:50] <ebernhardson>	 it's frankly amazing how hard it is to properly configure most http servers :) nginx is sadly almost as bad as apache ...
[16:45:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3316 (T316186)', diff saved to https://phabricator.wikimedia.org/P33131 and previous config saved to /var/cache/conftool/dbconfig/20220825-164556-ladsgroup.json
[16:46:22] <hashar>	 the good news is that we have Apache hackers at the wmf :-]
[16:47:31] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "I had it and still failed to save it, will reproduce it." [puppet] - 10https://gerrit.wikimedia.org/r/826331 (owner: 10Dzahn)
[16:48:33] <wikibugs>	 (03CR) 10Bking: [C: 03+2] query_service: Avoid passing content body to internal auth endpoints [puppet] - 10https://gerrit.wikimedia.org/r/825925 (https://phabricator.wikimedia.org/T306899) (owner: 10Ebernhardson)
[16:49:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ceph.bootstrap_and_add: add support to change the osd class type [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/824153 (https://phabricator.wikimedia.org/T314870) (owner: 10David Caro)
[16:49:53] <wikibugs>	 (03CR) 10Dzahn: "@cmooney there is a follow-up at https://gerrit.wikimedia.org/r/c/operations/puppet/+/824542" [puppet] - 10https://gerrit.wikimedia.org/r/824495 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar)
[16:51:57] <wikibugs>	 (03CR) 10Dzahn: "ok, thank you. I will comment here if the alert comes back ever." [puppet] - 10https://gerrit.wikimedia.org/r/826513 (owner: 10Slyngshede)
[16:52:12] <wikibugs>	 (03Abandoned) 10Dzahn: Revert "Revert "c:spamassassin move Spamassassin updates from crontab"" [puppet] - 10https://gerrit.wikimedia.org/r/826607 (owner: 10Dzahn)
[16:52:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33132 and previous config saved to /var/cache/conftool/dbconfig/20220825-165213-ladsgroup.json
[16:52:28] <wikibugs>	 (03PS4) 10Hashar: doc: properly redirect back compat URLs [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541)
[16:53:17] <wikibugs>	 (03CR) 10Hashar: "rebased since the child change got cherry picked and merged and ended up causing a conflict." [puppet] - 10https://gerrit.wikimedia.org/r/824542 (https://phabricator.wikimedia.org/T315541) (owner: 10Hashar)
[16:54:03] <wikibugs>	 (03PS1) 10BryanDavis: developer-portal: Bump container to 2022-08-23-080429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/826627
[17:00:04] <jouncebot>	 bd808: May I have your attention please! Technical Engagement weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1700)
[17:03:59] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: ATS should alert if the number of total or active connections reached maximum - https://phabricator.wikimedia.org/T292815 (10BCornwall) Change of plans: Kwaku has expressed an interest in backwards-compatibility so ATS 8 support will be added.
[17:04:01] <hashar>	  mutante I rebased the apache redirect patch since it ended up conflicting ;)
[17:04:16] <wikibugs>	 (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2022-08-23-080429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/826627 (owner: 10BryanDavis)
[17:04:30] <wikibugs>	 (03PS3) 10Ryan Kemper: opensearch: replace outdated config [puppet] - 10https://gerrit.wikimedia.org/r/826383 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[17:07:20] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P33133 and previous config saved to /var/cache/conftool/dbconfig/20220825-170719-ladsgroup.json
[17:07:30] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: Bump container to 2022-08-23-080429-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/826627 (owner: 10BryanDavis)
[17:08:50] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[17:09:15] <logmsgbot>	 !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[17:09:24] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[17:10:03] <logmsgbot>	 !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[17:10:11] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[17:10:58] <logmsgbot>	 !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[17:21:56] <wikibugs>	 (03PS4) 10Vgutierrez: trafficserver: Get rid of disable_coalescing() in Lua [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911)
[17:22:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P33135 and previous config saved to /var/cache/conftool/dbconfig/20220825-172225-ladsgroup.json
[17:29:11] <wikibugs>	 (03CR) 10Vgutierrez: "Tested cookie hiding for caching purposes in our WMCS environment, works as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/826586 (https://phabricator.wikimedia.org/T315911) (owner: 10Vgutierrez)
[17:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[17:29:47] <wikibugs>	 (03PS1) 10Bking: deployment-prep: remove defunct elastic hosts [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240)
[17:36:31] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: don't start es7 unit until we tell it [puppet] - 10https://gerrit.wikimedia.org/r/826396 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper)
[17:37:32] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T316186)', diff saved to https://phabricator.wikimedia.org/P33136 and previous config saved to /var/cache/conftool/dbconfig/20220825-173731-ladsgroup.json
[17:38:20] <wikibugs>	 (03CR) 10Hashar: "We could surely use some monitoring for the releng images. Probably not by failing the unit, but some kind of weekly report by email or si" [puppet] - 10https://gerrit.wikimedia.org/r/826211 (owner: 10Muehlenhoff)
[17:38:57] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[17:39:04] <hashar>	 mutante: would you merge the doc redirect fix up https://gerrit.wikimedia.org/r/c/operations/puppet/+/824542 ? the other comments only change got merged so I though you would deploy the fix as well :)
[17:39:11] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[17:41:24] <mutante>	 hashar: no, I was not going to merge that right now based on the history with doc redirects and the tests thing, I added reviewers and person who merged the last change though
[17:42:05] <mutante>	 I merged the other thing because it was comments only and confirmed the docs 
[17:43:23] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: don't start es 7 until ready [cookbooks] - 10https://gerrit.wikimedia.org/r/826397 (owner: 10Ryan Kemper)
[17:44:02] <hashar>	 well that previous change got blindly merged as part of clinic duty
[17:44:07] <wikibugs>	 (03CR) 10ArielGlenn: "Hannah and I looked at this, seems good to me, merge at will." [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah)
[17:44:09] <hashar>	 but well guess that can wait ;)
[17:44:12] <mutante>	 maybe that was the issue then
[17:44:27] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] P:dumps: remove ipv4/ipv6 separation from internal_rsync_clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/825685 (owner: 10Majavah)
[17:44:35] <mutante>	 clinic duty does not even include merging puppet changes
[17:44:55] <hashar>	 well that is how I get those puppet patches merged most of the time 
[17:45:12] <hashar>	 anyway it is not an urgent patch
[17:45:14] <mutante>	 I would prefer if we could change that
[17:45:27] <mutante>	 ok, great
[17:46:47] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "looks fine to me" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[17:47:21] <mutante>	 I have some other things going on but it won't be forgotten, it's in the queue
[17:47:30] <dancy>	 👍🏾 Thanks Daniel
[17:47:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2115.codfw.wmnet with reason: Maintenance
[17:48:09] <dancy>	 oh.. wrong channel.  thanks anyway. :-)
[17:48:20] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2115.codfw.wmnet with reason: Maintenance
[17:48:26] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2115 (T312160)', diff saved to https://phabricator.wikimedia.org/P33137 and previous config saved to /var/cache/conftool/dbconfig/20220825-174826-ladsgroup.json
[17:48:32] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[17:49:27] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[17:49:34] <hashar>	 mutante: no worries :-]
[17:49:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1131.eqiad.wmnet with reason: Maintenance
[17:49:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T316186)', diff saved to https://phabricator.wikimedia.org/P33138 and previous config saved to /var/cache/conftool/dbconfig/20220825-174946-ladsgroup.json
[17:54:55] <icinga-wm>	 PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:57:15] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T316186)', diff saved to https://phabricator.wikimedia.org/P33139 and previous config saved to /var/cache/conftool/dbconfig/20220825-175715-ladsgroup.json
[18:00:04] <jouncebot>	 hashar and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T1800).
[18:01:30] <dancy>	 I am going to use the train window to deploy a new version of scap
[18:04:03] <MatmaRex>	 unrelatedly, could anyone here review this short patch that i'd like to backport later today? https://gerrit.wikimedia.org/r/c/mediawiki/skins/Timeless/+/826633
[18:04:37] <wikibugs>	 (03PS1) 10Stang: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620)
[18:05:29] <icinga-wm>	 PROBLEM - Host ms-be2067.mgmt is DOWN: PING CRITICAL - Packet loss = 100%
[18:06:47] <icinga-wm>	 RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:11:46] <logmsgbot>	 !log dancy@deploy1002 install-world aborted:  (duration: 00m 02s)
[18:11:51] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.15.0" for 557 hosts
[18:12:21] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P33140 and previous config saved to /var/cache/conftool/dbconfig/20220825-181221-ladsgroup.json
[18:13:21] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.15.0" completed for 557 hosts
[18:18:44] <logmsgbot>	 !log bmansurov@deploy1002 Started deploy [airflow-dags/research@5712187]: (no justification provided)
[18:18:53] <logmsgbot>	 !log bmansurov@deploy1002 Finished deploy [airflow-dags/research@5712187]: (no justification provided) (duration: 00m 09s)
[18:19:26] <logmsgbot>	 !log dancy@deploy1002 rebuilt and synchronized wikiversions files: (no justification provided)
[18:20:40] <wikibugs>	 (03PS1) 10Stang: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620)
[18:22:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] Add systemd timer to run scap stage-train on Tuesday morning (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[18:22:42] <wikibugs>	 (03PS1) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620)
[18:24:12] <wikibugs>	 (03Abandoned) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/822197 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[18:25:47] <wikibugs>	 (03PS2) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620)
[18:27:28] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P33141 and previous config saved to /var/cache/conftool/dbconfig/20220825-182727-ladsgroup.json
[18:27:37] <wikibugs>	 (03PS3) 10Stang: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620)
[18:31:35] <icinga-wm>	 RECOVERY - Host ms-be2067.mgmt is UP: PING OK - Packet loss = 0%, RTA = 34.19 ms
[18:33:33] <ottomata>	 !log rolling restart of eventgate-analytics-external to pick up retroactive schema change for android schemas in T316047
[18:33:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:33:37] <stashbot>	 T316047: Make provisions for geodata in all MEP schemas - https://phabricator.wikimedia.org/T316047
[18:33:45] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: sync
[18:34:07] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: sync
[18:34:18] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: sync
[18:35:01] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: sync
[18:35:33] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "change has been deployed. on deploy1002 the timer and service has been created but of course it's just waiting now for next Tuesday. optio" [puppet] - 10https://gerrit.wikimedia.org/r/819180 (https://phabricator.wikimedia.org/T310395) (owner: 10Ahmon Dancy)
[18:36:03] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: sync
[18:36:37] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: sync
[18:38:51] <wikibugs>	 (03PS1) 10Bking: Revert "Revert "elastic: enable ES7 repo on cloudelastic"" [puppet] - 10https://gerrit.wikimedia.org/r/826609
[18:38:53] <wikibugs>	 (03PS1) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826639 (https://phabricator.wikimedia.org/T308620)
[18:39:42] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] Revert "Revert "elastic: enable ES7 repo on cloudelastic"" [puppet] - 10https://gerrit.wikimedia.org/r/826609 (owner: 10Bking)
[18:40:06] <wikibugs>	 (03CR) 10Bking: [C: 03+2] Revert "Revert "elastic: enable ES7 repo on cloudelastic"" [puppet] - 10https://gerrit.wikimedia.org/r/826609 (owner: 10Bking)
[18:42:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T316186)', diff saved to https://phabricator.wikimedia.org/P33142 and previous config saved to /var/cache/conftool/dbconfig/20220825-184233-ladsgroup.json
[18:42:41] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[18:42:55] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[18:43:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T316186)', diff saved to https://phabricator.wikimedia.org/P33143 and previous config saved to /var/cache/conftool/dbconfig/20220825-184301-ladsgroup.json
[18:45:52] <logmsgbot>	 !log ebernhardson@deploy1002 Started deploy [wikimedia/discovery/analytics@d00af45]: bump elasticsearch-hadoop to 7.10.2
[18:47:40] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[18:47:44] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[18:48:00] <logmsgbot>	 !log ebernhardson@deploy1002 Finished deploy [wikimedia/discovery/analytics@d00af45]: bump elasticsearch-hadoop to 7.10.2 (duration: 02m 07s)
[18:48:22] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[18:49:11] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T316186)', diff saved to https://phabricator.wikimedia.org/P33144 and previous config saved to /var/cache/conftool/dbconfig/20220825-184911-ladsgroup.json
[18:54:07] <wikibugs>	 (03PS1) 10Bking: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676)
[18:54:56] <wikibugs>	 (03PS2) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[18:58:33] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops, 10Patch-For-Review: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049 (10Papaul) 05Open→03Resolved @Eevans thanks the host is back online. the back plane replacement fixed the issue .
[18:58:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[19:03:20] <wikibugs>	 (03PS1) 10Urbanecm: cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283)
[19:04:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33145 and previous config saved to /var/cache/conftool/dbconfig/20220825-190417-ladsgroup.json
[19:07:55] <wikibugs>	 (03PS3) 10Bking: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676)
[19:10:11] <wikibugs>	 (03PS4) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[19:19:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P33146 and previous config saved to /var/cache/conftool/dbconfig/20220825-191924-ladsgroup.json
[19:22:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Kanban): Replace cloudnet100[34] with cloudnet100[56] - https://phabricator.wikimedia.org/T316284 (10Andrew)
[19:24:30] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q4: (Need By: TBD) rack/setup/install 6 wmcs hosts - https://phabricator.wikimedia.org/T304888 (10Andrew)
[19:25:13] <icinga-wm>	 PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[19:25:34] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Hide new 'associatedPages' navigation items [skins/Timeless] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826610 (https://phabricator.wikimedia.org/T316196)
[19:27:42] <wikibugs>	 (03PS1) 10Andrew Bogott: Remove refs to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/826645 (https://phabricator.wikimedia.org/T316285)
[19:29:02] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudservices1003
[19:29:29] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Make DiscussionTools autotopicsub also opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826646 (https://phabricator.wikimedia.org/T314693)
[19:29:39] <wikibugs>	 (03PS1) 10Ryan Kemper: elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676)
[19:31:01] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Remove refs to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/826645 (https://phabricator.wikimedia.org/T316285) (owner: 10Andrew Bogott)
[19:31:10] <wikibugs>	 (03PS2) 10Andrew Bogott: Remove refs to cloudservices1003 [puppet] - 10https://gerrit.wikimedia.org/r/826645 (https://phabricator.wikimedia.org/T316285)
[19:31:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw: Degraded RAID on ms-be2035 - https://phabricator.wikimedia.org/T316194 (10Ladsgroup)
[19:32:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T312160)', diff saved to https://phabricator.wikimedia.org/P33147 and previous config saved to /var/cache/conftool/dbconfig/20220825-193238-ladsgroup.json
[19:32:43] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[19:33:37] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.dns.netbox
[19:33:48] <wikibugs>	 (03CR) 10Bking: [C: 03+2] elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper)
[19:34:24] <wikibugs>	 (03CR) 10Bking: [V: 03+2 C: 03+2] elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper)
[19:34:30] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T316186)', diff saved to https://phabricator.wikimedia.org/P33148 and previous config saved to /var/cache/conftool/dbconfig/20220825-193430-ladsgroup.json
[19:34:37] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[19:34:51] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[19:34:52] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:35:08] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[19:35:14] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33149 and previous config saved to /var/cache/conftool/dbconfig/20220825-193513-ladsgroup.json
[19:36:57] <urandom>	 !log rebooting ms-be2067 to "fix" disk enumeration(?) -- T314049
[19:37:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:37:01] <stashbot>	 T314049: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049
[19:37:26] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[19:37:27] <logmsgbot>	 !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudservices1003
[19:37:46] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: no need to run puppet during es 7 upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/826647 (https://phabricator.wikimedia.org/T308676) (owner: 10Ryan Kemper)
[19:41:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[19:41:06] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[19:41:29] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33150 and previous config saved to /var/cache/conftool/dbconfig/20220825-194129-ladsgroup.json
[19:42:07] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[19:45:39] <wikibugs>	 (03PS5) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[19:45:55] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) a:05cmooney→03Andrew @Andrew I indeed routed the subnet, which was already allocated to WMCS in codfw.  It seems I failed to update the description fo...
[19:47:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P33151 and previous config saved to /var/cache/conftool/dbconfig/20220825-194744-ladsgroup.json
[19:51:00] <wikibugs>	 (03PS6) 10Ryan Kemper: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[19:55:57] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[19:56:35] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33152 and previous config saved to /var/cache/conftool/dbconfig/20220825-195635-ladsgroup.json
[20:00:05] <jouncebot>	 brennen: How many deployers does it take to do UTC late backport and config training deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220825T2000).
[20:00:05] <jouncebot>	 jan_drewniak, koi, Urbanecm, and MatmaRex: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <urbanecm>	 o/
[20:00:25] <koi>	 o/
[20:00:28] <jan_drewniak>	 o/
[20:00:28] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: fix string concatenation [cookbooks] - 10https://gerrit.wikimedia.org/r/826640 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[20:01:11] <MatmaRex>	 hi
[20:01:20] <thcipriani>	 howdy all
[20:01:39] <thcipriani>	 looks like a full window :D
[20:01:53] <urbanecm>	 thcipriani: yup! i'm happy to deploy if you want me to, or i can leave it to you.
[20:02:08] <jan_drewniak>	 last backport window of the week :P
[20:02:10] <MatmaRex>	 we can probably do all of the non-config patches in parallel
[20:02:11] <wikibugs>	 (03CR) 10Thcipriani: [C: 03+2] Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) (owner: 10Jdrewniak)
[20:02:51] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115', diff saved to https://phabricator.wikimedia.org/P33153 and previous config saved to /var/cache/conftool/dbconfig/20220825-200250-ladsgroup.json
[20:02:57] <thcipriani>	 urbanecm: well. We've got no takers for backport training today. I'm happy to yield the deployment conch to you if you're up for it.
[20:03:08] <urbanecm>	 sure
[20:03:21] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Update VE core submodule to master (d4c438548) [extensions/VisualEditor] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826345 (https://phabricator.wikimedia.org/T316219) (owner: 10Bartosz Dziewoński)
[20:03:34] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Hide new 'associatedPages' navigation items [skins/Timeless] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826610 (https://phabricator.wikimedia.org/T316196) (owner: 10Bartosz Dziewoński)
[20:03:35] <thcipriani>	 <3
[20:05:49] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Make DiscussionTools autotopicsub also opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826646 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński)
[20:06:42] <wikibugs>	 (03Merged) 10jenkins-bot: Make DiscussionTools autotopicsub also opt-out on A/B test wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826646 (https://phabricator.wikimedia.org/T314693) (owner: 10Bartosz Dziewoński)
[20:07:01] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:04] <wikibugs>	 (03PS2) 10Urbanecm: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:09] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:15] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[20:07:16] <wikibugs>	 (03PS2) 10Urbanecm: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:19] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[20:07:29] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:33] <wikibugs>	 (03PS4) 10Urbanecm: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:36] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:07:50] <urbanecm>	 MatmaRex: your config patch is at mwdebug1001, can you have a look please?
[20:07:57] <wikibugs>	 (03Merged) 10jenkins-bot: Define default value for "wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826635 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:08:00] <MatmaRex>	 looking
[20:08:04] <wikibugs>	 (03Merged) 10jenkins-bot: Add language fallback support for wmgSiteLogoVariants [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826636 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:08:25] <wikibugs>	 (03Merged) 10jenkins-bot: zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826637 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[20:08:44] <urbanecm>	 koi: fyi, i'm going to do the first three patches, the last one separately, as it changes other wiki (and depends on whether the first three are w/o issues).
[20:09:05] <koi>	 got it, thanks
[20:09:13] <MatmaRex>	 urbanecm: seems good
[20:09:16] <urbanecm>	 thanks, syncing
[20:11:17] <urbanecm>	 php-fpm restart has a progress indicator now, great.
[20:11:42] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P33154 and previous config saved to /var/cache/conftool/dbconfig/20220825-201141-ladsgroup.json
[20:11:54] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[20:12:13] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:13:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:13:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:13:24] <dancy>	 urbanecm: You're welcome. :-)
[20:13:52] <urbanecm>	 :)
[20:14:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:14:30] <urandom>	 !log re-rebooting ms-be2067 to "fix" disk enumeration(?) -- T314049
[20:14:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:14:34] <stashbot>	 T314049: Degraded RAID on ms-be2067 - https://phabricator.wikimedia.org/T314049
[20:15:51] <urbanecm>	 hmm. my connection to deploy host terminated, scap process is not running apparently, but lock was not released
[20:16:00] <urbanecm>	 can someone help please?
[20:16:31] <urbanecm>	 it _looks_ like i can just remove `/var/lock/scap.operations_mediawiki-config.lock` and re-sync, but I'd like confirmation before doing that.
[20:16:37] <dancy>	 yes, you can do that.
[20:16:37] <urbanecm>	 dancy: maybe you can help? :)
[20:16:42] <urbanecm>	 okay.
[20:16:47] <urbanecm>	 doing
[20:17:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use  FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10cmooney) Nice work!  Eventually all things considered it's probably best to control it from Netbox.  But I agree the existing mechanism works well i...
[20:17:21] <urbanecm>	 !log [urbanecm@deploy1002 ~]$ rm /var/lock/scap.operations_mediawiki-config.lock # connection to deploy1002 handled, to let me re-sync
[20:17:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:17:28] <urbanecm>	 and syncing again
[20:17:57] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2115 (T312160)', diff saved to https://phabricator.wikimedia.org/P33155 and previous config saved to /var/cache/conftool/dbconfig/20220825-201756-ladsgroup.json
[20:17:59] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[20:18:01] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[20:18:12] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2101.codfw.wmnet with reason: Maintenance
[20:18:46] <wikibugs>	 (03PS3) 10AOkoth: gitlab: copy ssh host keys for failover [puppet] - 10https://gerrit.wikimedia.org/r/820163 (https://phabricator.wikimedia.org/T296713)
[20:18:48] <wikibugs>	 (03PS1) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650
[20:19:05] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:19:28] <wikibugs>	 (03PS2) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650
[20:21:46] <wikibugs>	 (03PS1) 10Bking: elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676)
[20:22:19] <wikibugs>	 (03Merged) 10jenkins-bot: Add clearfix to .mw-body-subheader [skins/Vector] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826608 (https://phabricator.wikimedia.org/T316134) (owner: 10Jdrewniak)
[20:22:21] <wikibugs>	 (03Merged) 10jenkins-bot: Update VE core submodule to master (d4c438548) [extensions/VisualEditor] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826345 (https://phabricator.wikimedia.org/T316219) (owner: 10Bartosz Dziewoński)
[20:22:29] <wikibugs>	 (03Merged) 10jenkins-bot: Hide new 'associatedPages' navigation items [skins/Timeless] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826610 (https://phabricator.wikimedia.org/T316196) (owner: 10Bartosz Dziewoński)
[20:22:51] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+1] elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[20:23:28] <wikibugs>	 (03CR) 10Ryan Kemper: [C: 03+2] elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[20:23:44] <wikibugs>	 (03CR) 10Dzahn: "the 1 line for envoy needs to move to ./hosts/ but everything else should stay in common.yaml" [puppet] - 10https://gerrit.wikimedia.org/r/826650 (owner: 10AOkoth)
[20:23:56] <wikibugs>	 (03PS3) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650
[20:24:10] <wikibugs>	 (03PS4) 10AOkoth: vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650
[20:24:46] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f37eff3f1607c898120c4f151b0af0d4b6bfdd19: Make DiscussionTools autotopicsub also opt-out on A/B test wikis (T314693) (duration: 03m 37s)
[20:24:49] <urbanecm>	 finally
[20:24:51] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "yep, testing, fake values, lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/826650 (owner: 10AOkoth)
[20:24:51] <stashbot>	 T314693: [Config Change] Make Topic Subscriptions available by default at A/B test wikis (desktop) - https://phabricator.wikimedia.org/T314693
[20:25:12] <wikibugs>	 (03CR) 10AOkoth: [C: 03+2] vrts: add cloud hieradata for devtools project [puppet] - 10https://gerrit.wikimedia.org/r/826650 (owner: 10AOkoth)
[20:26:02] <urbanecm>	 MatmaRex: jan_drewniak: your backports are at mwdebug1001, please test
[20:26:13] <urbanecm>	 koi: your first three config patches are at mwdebug1001 too, please test
[20:26:26] <koi>	 looking
[20:26:32] <MatmaRex>	 thanks
[20:26:38] <wikibugs>	 (03Merged) 10jenkins-bot: elastic: use correct systemd command [cookbooks] - 10https://gerrit.wikimedia.org/r/826651 (https://phabricator.wikimedia.org/T308676) (owner: 10Bking)
[20:26:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T316186)', diff saved to https://phabricator.wikimedia.org/P33156 and previous config saved to /var/cache/conftool/dbconfig/20220825-202647-ladsgroup.json
[20:26:51] <jan_drewniak>	 urbanecm: mine looks good
[20:26:56] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[20:26:59] <urbanecm>	 thanks, syncing
[20:27:10] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[20:27:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33157 and previous config saved to /var/cache/conftool/dbconfig/20220825-202716-ladsgroup.json
[20:27:43] <MatmaRex>	 urbanecm: both look good
[20:28:19] <urbanecm>	 thanks, will sync too
[20:29:11] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:29:49] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:30:38] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:30:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:31:40] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/skins/Vector/resources/skins.vector.styles/layouts/screen.less: fe3382ea74a7ca5c8954ed456f4cd100208ed1e6: Add clearfix to .mw-body-subheader (T316134, T316095) (duration: 03m 25s)
[20:31:45] <stashbot>	 T316134: Page indicators are in line with content - https://phabricator.wikimedia.org/T316134
[20:31:46] <stashbot>	 T316095: PAGEBANNER is not displaying at euwiki with New Vector - https://phabricator.wikimedia.org/T316095
[20:32:07] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:32:10] <urbanecm>	 jan_drewniak: your patch is live
[20:32:12] <koi>	 urbanecm: unfortunately it does not work
[20:32:21] <urbanecm>	 okay, so i'll revert (and skip the fourth?)
[20:32:36] <jan_drewniak>	 urbanecm: as always, thanks! 
[20:32:42] <urbanecm>	 happy to help!
[20:33:07] <wikibugs>	 (03PS1) 10Urbanecm: Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826611 (https://phabricator.wikimedia.org/T308620)
[20:33:15] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826611 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm)
[20:33:20] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "zhwiki: Use wmgSiteLogoVariantFallback to reduce duplicated code" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826611 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm)
[20:33:31] <koi>	 I thought is it ok to only revert to third one? I would like to figure out what to do later and the previous two has no affect
[20:33:39] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826612 (https://phabricator.wikimedia.org/T308620)
[20:33:53] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[20:33:58] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[20:34:48] <urbanecm>	 koi: what's the nature of "it does not work" please? if the bug is in the code you added to CS.php, wouldn't we need to rewrite it anyway (so revert is ok)?
[20:35:10] <urbanecm>	 I'm not really a fan of having variables in IS.php that are knowingly-broken
[20:35:43] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/skins/Timeless/: ba0e981890aa6eb61598e4df786f7122e17b3002: Hide new associatedPages navigation items (T316196) (duration: 03m 41s)
[20:35:47] <stashbot>	 T316196: Timeless’ namespace tabs are duplicated - https://phabricator.wikimedia.org/T316196
[20:37:10] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:38:11] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:38:11] <koi>	 these three patch should make everything looks the same before them, but the now the wrong logo was shown for some variant (cn/my/sg)
[20:38:39] <koi>	  I'm fine with revert them all, and nvm about the reason I said that (keep broken thing inside CS.php) before
[20:38:39] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:38:40] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:39:19] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:39:25] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.26/extensions/VisualEditor/: 223e81f08e1f62b1ed78bcb2bdcc104e7fb60734: Update VE core submodule to master (d4c438548; T316219) (duration: 03m 42s)
[20:39:30] <stashbot>	 T316219: Mention autocompletion doesn't work as expected with the reply tool - https://phabricator.wikimedia.org/T316219
[20:39:54] <urbanecm>	 okay, i'll revert hem all in that case
[20:39:57] <urbanecm>	 MatmaRex: your patches are live now
[20:40:02] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:40:06] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Add language fallback support for wmgSiteLogoVariants" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826612 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm)
[20:40:08] <MatmaRex>	 thanks
[20:40:19] <wikibugs>	 (03PS1) 10Urbanecm: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826613 (https://phabricator.wikimedia.org/T308620)
[20:40:26] <wikibugs>	 (03PS2) 10Urbanecm: Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826613 (https://phabricator.wikimedia.org/T308620)
[20:40:29] <wikibugs>	 (03CR) 10Urbanecm: [V: 03+2 C: 03+2] Revert "Define default value for "wmgSiteLogoVariants"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826613 (https://phabricator.wikimedia.org/T308620) (owner: 10Urbanecm)
[20:40:50] <wikibugs>	 (03PS2) 10Urbanecm: cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283)
[20:40:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283) (owner: 10Urbanecm)
[20:41:40] <wikibugs>	 (03Merged) 10jenkins-bot: cswiki: Add extendedconfirmed group/protection level [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826641 (https://phabricator.wikimedia.org/T316283) (owner: 10Urbanecm)
[20:42:37] <urbanecm>	 patch works, syncing
[20:42:44] <wikibugs>	 (03CR) 10Andrea Denisse: doc: Fix smalll typos in the systemd::sysuser documentation. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/826490 (owner: 10Andrea Denisse)
[20:42:48] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+2] doc: Fix smalll typos in the systemd::sysuser documentation. [puppet] - 10https://gerrit.wikimedia.org/r/826490 (owner: 10Andrea Denisse)
[20:42:59] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:45:05] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:45:47] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.remove-downtime for ms-be2067.codfw.wmnet
[20:45:47] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be2067.codfw.wmnet
[20:46:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:46:26] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:46:48] <logmsgbot>	 !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1aafdf0bd1d33929f2dd75ef4da9772d8832a31c: cswiki: Add extendedconfirmed group/protection level (T316283) (duration: 03m 42s)
[20:46:52] <stashbot>	 T316283: Create `extendedconfirmed` at cswiki and make it possible to protect pages on that level - https://phabricator.wikimedia.org/T316283
[20:46:54] <urbanecm>	 and, looks like we're done
[20:47:07] <urbanecm>	 !log UTC late B&C window done
[20:47:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:47:17] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:47:47] <icinga-wm>	 PROBLEM - High average GET latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:48:33] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) a:05Andrew→03Cmjohnson
[20:49:06] <wikibugs>	 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Kanban): decommission cloudservices1003.wikimedia..org - https://phabricator.wikimedia.org/T316285 (10Andrew) @cmjohnson, this is another host that will need its drives wiped, as the cookbook seems to be bad at that lately.  Thanks!
[20:51:43] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): decom cookbook often fails to wipe drives in HP systems - https://phabricator.wikimedia.org/T316292 (10Andrew)
[20:52:21] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply
[20:53:23] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply
[20:53:24] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply
[20:53:43] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:54:25] <logmsgbot>	 !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply
[20:56:33] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[20:56:38] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[20:59:43] <icinga-wm>	 RECOVERY - High average GET latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=GET
[20:59:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Netbox: use FHRP Groups feature - https://phabricator.wikimedia.org/T311218 (10Reedy)
[21:01:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33158 and previous config saved to /var/cache/conftool/dbconfig/20220825-210130-ladsgroup.json
[21:02:04] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[21:02:08] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[21:02:20] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10Andrew) The label should just be 'public floating IPs for cloud-vps codfw1dev' -- by their very nature the actual use of any particular IP will shift over time bas...
[21:04:09] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:09:27] <icinga-wm>	 PROBLEM - Check systemd state on snapshot1008 is CRITICAL: CRITICAL - degraded: The following units failed: cirrussearch-dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:12:03] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic elasticsearch and plugin upgrade - bking@cumin2002 - T316159
[21:12:09] <stashbot>	 T316159: Upgrade cloudelastic cluster to 7.10.2 - https://phabricator.wikimedia.org/T316159
[21:16:37] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33159 and previous config saved to /var/cache/conftool/dbconfig/20220825-211637-ladsgroup.json
[21:29:13] <jinxer-wm>	 (KubernetesRsyslogDown) firing: (8) rsyslog on dse-k8s-worker1001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[21:31:43] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P33160 and previous config saved to /var/cache/conftool/dbconfig/20220825-213143-ladsgroup.json
[21:35:23] <icinga-wm>	 RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:39:12] <wikibugs>	 10SRE: Sunset search.wikimedia.org service - https://phabricator.wikimedia.org/T316296 (10mpopov)
[21:46:50] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33161 and previous config saved to /var/cache/conftool/dbconfig/20220825-214649-ladsgroup.json
[21:47:03] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[21:47:16] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2180.codfw.wmnet with reason: Maintenance
[21:47:22] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33162 and previous config saved to /var/cache/conftool/dbconfig/20220825-214722-ladsgroup.json
[21:52:48] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33163 and previous config saved to /var/cache/conftool/dbconfig/20220825-215247-ladsgroup.json
[22:07:54] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P33164 and previous config saved to /var/cache/conftool/dbconfig/20220825-220753-ladsgroup.json
[22:09:18] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db2131.codfw.wmnet with reason: Maintenance
[22:09:31] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2131.codfw.wmnet with reason: Maintenance
[22:09:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2131 (T312160)', diff saved to https://phabricator.wikimedia.org/P33165 and previous config saved to /var/cache/conftool/dbconfig/20220825-220937-ladsgroup.json
[22:09:42] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160
[22:22:33] <wikibugs>	 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops: Undocumented IP on WMCS network - https://phabricator.wikimedia.org/T315955 (10cmooney) Thanks Andrew, I've updated the description for the codfw range now.  In terms of DNS I don't seem to get any PTR records back for the ranges in codfw:  `...
[22:23:00] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180', diff saved to https://phabricator.wikimedia.org/P33167 and previous config saved to /var/cache/conftool/dbconfig/20220825-222259-ladsgroup.json
[22:30:38] <wikibugs>	 (03PS1) 10Dduvall: docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501)
[22:32:25] <wikibugs>	 (03CR) 10Dduvall: "From the commit msg:" [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[22:34:47] <icinga-wm>	 PROBLEM - DNS on cloudservices1003.mgmt is CRITICAL: Domain cloudservices1003.mgmt.eqiad.wmnet was not found by the server https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[22:38:06] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2180 (T316186)', diff saved to https://phabricator.wikimedia.org/P33168 and previous config saved to /var/cache/conftool/dbconfig/20220825-223805-ladsgroup.json
[22:38:40] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+1] "confirm these hosts are all decom'd" [puppet] - 10https://gerrit.wikimedia.org/r/826630 (https://phabricator.wikimedia.org/T316240) (owner: 10Bking)
[22:48:03] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] docker_registry_ha: Authorize GitLab trusted runners using JWT [puppet] - 10https://gerrit.wikimedia.org/r/826674 (https://phabricator.wikimedia.org/T308501) (owner: 10Dduvall)
[23:13:04] <wikibugs>	 (03PS1) 10Stang: bewikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826677 (https://phabricator.wikimedia.org/T310961)
[23:16:28] <wikibugs>	 (03PS1) 10Stang: euwikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826678 (https://phabricator.wikimedia.org/T310961)
[23:18:35] <wikibugs>	 (03PS1) 10Stang: cswikisource: Adjust width-height ratio of logo to fix display issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826679 (https://phabricator.wikimedia.org/T310961)
[23:20:14] <wikibugs>	 (03Abandoned) 10Stang: Add wmgSiteLogoVariants support for Chinese Wikimedia projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/826639 (https://phabricator.wikimedia.org/T308620) (owner: 10Stang)
[23:20:45] <wikibugs>	 (03PS1) 10Zabe: Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150)
[23:22:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150) (owner: 10Zabe)
[23:23:30] <wikibugs>	 (03PS1) 10Zabe: phan: Fix use of IMaintainableDatabase::tableExists [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826615
[23:23:39] <wikibugs>	 (03PS2) 10Zabe: Use real transactions when managing the voter list for a poll [extensions/SecurePoll] (wmf/1.39.0-wmf.26) - 10https://gerrit.wikimedia.org/r/826614 (https://phabricator.wikimedia.org/T316150)
[23:30:39] <icinga-wm>	 RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook
[23:53:01] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2131 (T312160)', diff saved to https://phabricator.wikimedia.org/P33169 and previous config saved to /var/cache/conftool/dbconfig/20220825-235300-ladsgroup.json
[23:53:07] <stashbot>	 T312160: Adjust the field type of cx_corpora.cxc_timestamp to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312160