[00:04:20] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.067 second response time https://wikitech.wikimedia.org/wiki/Swift [00:06:15] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10tstarling) p:05Triage→03Unbreak! Logstash search for SwiftFileBackend {F35317523} [00:06:42] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.036 second response time https://wikitech.wikimedia.org/wiki/Swift [00:11:00] RECOVERY - Swift https frontend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Swift [00:13:50] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:19:08] PROBLEM - Swift https backend on ms-fe1010 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.094 second response time https://wikitech.wikimedia.org/wiki/Swift [00:26:24] RECOVERY - Swift https backend on ms-fe1010 is OK: HTTP OK: HTTP/1.1 200 OK - 451 bytes in 0.022 second response time https://wikitech.wikimedia.org/wiki/Swift [00:27:18] PROBLEM - Check systemd state on webperf1004 is CRITICAL: CRITICAL - degraded: The following units failed: arclamp_compress_logs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:27:33] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10tstarling) You can see it in the nginx log sizes on ms-fe1010: ` -rw-r----- 1 www-data www-data 6904870 Jul 15 00:24 unified.error.log -rw-r----- 1 www-data www-data 667552422 Jul 15 00:00 unified.error.log.1 -rw-r-----... [00:30:49] !log on ms-fe1010 restarting swift-proxy [00:30:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:39:06] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10tstarling) I restarted swift-proxy on ms-fe1010, which I think has fixed it. Here's how I realised it was a problem specific to ms-fe1010: {F35317550} [00:43:24] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10tstarling) p:05Unbreak!→03Medium Logstash, CPU usage and nginx logs all show recovery. I will leave it open at reduced priority until the relevant SRE folks see it, for post mortem analysis and followup. [01:07:48] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:18:44] RECOVERY - Check systemd state on webperf1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:45] (JobUnavailable) firing: Reduced availability for job workhorse in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:45] (JobUnavailable) firing: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:45] (JobUnavailable) resolved: (4) Reduced availability for job nginx in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:17:43] (03CR) 10RLazarus: [C: 03+1] fix flask/jinja2 semver snafu [software/klaxon] - 10https://gerrit.wikimedia.org/r/813938 (owner: 10CDanis) [03:18:08] (03CR) 10RLazarus: [C: 03+1] restore styling accidentally removed in 16f1d6c [software/klaxon] - 10https://gerrit.wikimedia.org/r/813939 (owner: 10CDanis) [03:22:14] (03CR) 10RLazarus: Don't hardcode v1 of the api in the base path (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 (owner: 10CDanis) [03:23:38] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:25:52] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.067 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [03:26:46] (03CR) 10RLazarus: [C: 03+1] Don't hardcode v1 of the api in the base path (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 (owner: 10CDanis) [03:43:19] (03CR) 10RLazarus: [C: 03+1] Add support for fetching current oncallers (033 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813941 (owner: 10CDanis) [03:59:54] (03CR) 10RLazarus: [C: 03+1] display current oncallers in Klaxon UI (032 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 (owner: 10CDanis) [04:29:46] PROBLEM - Check systemd state on contint2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-system-prune-dangling.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:29] (03PS1) 10Marostegui: Revert "db1135,dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/813960 [05:04:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 1%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31114 and previous config saved to /var/cache/conftool/dbconfig/20220715-050400-root.json [05:04:01] (03CR) 10Marostegui: [C: 03+2] Revert "db1135,dbproxy1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/813960 (owner: 10Marostegui) [05:09:26] (03PS2) 10Krinkle: xenon: Switch to systemd::sysuser [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff) [05:10:11] (03CR) 10Krinkle: [C: 03+1] "Perhaps we should rename this at some point, to match the current service and directory naming." [puppet] - 10https://gerrit.wikimedia.org/r/804546 (owner: 10Muehlenhoff) [05:19:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 2%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31115 and previous config saved to /var/cache/conftool/dbconfig/20220715-051904-root.json [05:20:55] (03PS1) 10Krinkle: WIP: Testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 [05:20:57] (03PS1) 10Krinkle: [DNM] Verify buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814012 [05:21:47] (03PS2) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [05:21:49] (03PS2) 10Krinkle: [DNM] Verify buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814012 [05:22:48] (03PS3) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [05:22:50] (03PS3) 10Krinkle: [DNM] Verify buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814012 [05:25:11] (03PS4) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [05:25:13] (03PS4) 10Krinkle: [DNM] Verify buildConfigCache.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814012 [05:32:45] (03PS5) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [05:33:20] (03CR) 10CI reject: [V: 04-1] tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [05:34:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 5%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31116 and previous config saved to /var/cache/conftool/dbconfig/20220715-053408-root.json [05:35:14] (03PS6) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [05:35:42] (03PS7) 10Krinkle: tests: Move buildConfigCache.php to tests/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) [05:49:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 10%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31117 and previous config saved to /var/cache/conftool/dbconfig/20220715-054912-root.json [06:04:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 25%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31118 and previous config saved to /var/cache/conftool/dbconfig/20220715-060416-root.json [06:08:37] !log T311939 Updated list of masters for psi-codfw search to `elastic2027.codfw.wmnet:9700,elastic2029.codfw.wmnet:9700,elastic2054.codfw.wmnet:9700` [06:08:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:40] T311939: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 [06:11:24] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch, 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10RKemper) Following method in https://phabricator.wikimedia.org/T294805#7701855, set the new codfw psi seeds: With: ` ryankemper@mwmaint1002:~/elastic$... [06:15:00] Thanks ryankemper [06:17:08] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10RhinosF1) ms-fe1010 was flapping saying it's frontend was critical yesterday. [06:19:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 50%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31119 and previous config saved to /var/cache/conftool/dbconfig/20220715-061920-root.json [06:30:51] (03PS1) 10KartikMistry: Enable Content and Section translation on WPs with NLLB-200 MT support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814015 (https://phabricator.wikimedia.org/T309384) [06:31:27] (03CR) 10CI reject: [V: 04-1] Enable Content and Section translation on WPs with NLLB-200 MT support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814015 (https://phabricator.wikimedia.org/T309384) (owner: 10KartikMistry) [06:34:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 75%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31120 and previous config saved to /var/cache/conftool/dbconfig/20220715-063424-root.json [06:34:53] (03PS2) 10KartikMistry: Enable Content and Section translation on WPs with NLLB-200 MT support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814015 (https://phabricator.wikimedia.org/T309384) [06:35:52] (03PS3) 10Marostegui: core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 [06:48:33] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:49:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1135 (re)pooling @ 100%: After maintenance', diff saved to https://phabricator.wikimedia.org/P31121 and previous config saved to /var/cache/conftool/dbconfig/20220715-064928-root.json [06:53:54] (03PS1) 10Marostegui: db2084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/814094 (https://phabricator.wikimedia.org/T311493) [06:54:59] (03CR) 10Marostegui: [C: 03+2] db2084: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/814094 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [06:57:07] (03PS1) 10Marostegui: mariadb: Productionize db2166 [puppet] - 10https://gerrit.wikimedia.org/r/814095 (https://phabricator.wikimedia.org/T311493) [06:58:19] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2166 [puppet] - 10https://gerrit.wikimedia.org/r/814095 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220715T0700) [07:10:15] (03PS1) 10Marostegui: site.pp: Remove db2166 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/814096 (https://phabricator.wikimedia.org/T311493) [07:11:28] (03CR) 10Marostegui: [C: 03+2] site.pp: Remove db2166 from insetup [puppet] - 10https://gerrit.wikimedia.org/r/814096 (https://phabricator.wikimedia.org/T311493) (owner: 10Marostegui) [07:16:11] (03PS1) 10Marostegui: change_change_time_T313070.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814097 (https://phabricator.wikimedia.org/T313070) [07:26:03] !log update thirdparty/node14 to Node 14.20.0 [07:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:22] !log update thirdparty/node16 to Node 16.16.0 [07:26:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:25] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP wmf group for Aline Bruenger WMDE - https://phabricator.wikimedia.org/T312220 (10karapayneWMDE) >>! In T312220#8066406, @jhathaway wrote: > @karapayneWMDE do you happen to know? Apologies for the delay, Aline is indeed a new WMDE employee. They're not in my... [07:42:31] (03CR) 10David Caro: [C: 03+1] "LGTM, just fix the errors in jenkins (see message below), feel free to ignore the nits." [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/812916 (owner: 10Nskaggs) [07:54:22] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html-offline-resources/{title} (Get offline resource links to accompany page content HTML for test page) is CRITICAL: Test Get offline resource links to accompany page content HTML for test page returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [07:56:32] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [08:00:26] PROBLEM - SSH on wtp1040.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:02:06] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:04:10] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [08:12:50] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10ayounsi) 05Resolved→03Open https://netbox.wikimedia.org/dcim/devices/2612/ and https://netbox.wikimedia.org/dcim/devices/2252/ still show up as being in rack `D5` but cabled to a differen... [08:20:11] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack D5 - https://phabricator.wikimedia.org/T308331 (10ayounsi) [09:03:30] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:07:39] (03PS1) 10Cparle: Update config for commons custommatch search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814108 [09:13:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:13:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1145.eqiad.wmnet with reason: Maintenance [09:15:25] (03CR) 10Ladsgroup: [C: 03+1] change_change_time_T313070.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814097 (https://phabricator.wikimedia.org/T313070) (owner: 10Marostegui) [09:15:40] PROBLEM - Check systemd state on logstash1026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:22:29] (03CR) 10Ladsgroup: [C: 03+1] "It works https://integration.wikimedia.org/ci/job/operations-mw-config-php72-composer-diffConfig-docker/12389/console" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814011 (https://phabricator.wikimedia.org/T169821) (owner: 10Krinkle) [09:24:38] RECOVERY - ElasticSearch setting check - 9400 on elastic2047 is OK: OK - All good! https://wikitech.wikimedia.org/wiki/Search%23Administration [09:28:08] (03CR) 10Marostegui: [C: 03+2] change_change_time_T313070.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814097 (https://phabricator.wikimedia.org/T313070) (owner: 10Marostegui) [09:28:35] (03Merged) 10jenkins-bot: change_change_time_T313070.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/814097 (https://phabricator.wikimedia.org/T313070) (owner: 10Marostegui) [09:30:41] (03PS1) 10Cparle: Make weighted_tags search default for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814111 [09:34:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:34:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1123.eqiad.wmnet with reason: Maintenance [09:34:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T312984)', diff saved to https://phabricator.wikimedia.org/P31123 and previous config saved to /var/cache/conftool/dbconfig/20220715-093449-ladsgroup.json [09:34:53] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [09:37:30] (03CR) 10Matthias Mullie: [C: 03+1] Make weighted_tags search default for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814111 (owner: 10Cparle) [09:37:33] (03CR) 10Matthias Mullie: [C: 03+1] Update config for commons custommatch search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814108 (owner: 10Cparle) [09:38:12] !log killed refreshLinkRecommendations.php in testwiki (T299021) [09:38:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:15] T299021: Shorten running time of refreshLinkRecommendations.php - https://phabricator.wikimedia.org/T299021 [09:49:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312984)', diff saved to https://phabricator.wikimedia.org/P31124 and previous config saved to /var/cache/conftool/dbconfig/20220715-094958-ladsgroup.json [09:50:03] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:03:20] RECOVERY - SSH on wtp1040.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:05:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P31125 and previous config saved to /var/cache/conftool/dbconfig/20220715-100503-ladsgroup.json [10:06:24] (03CR) 10Ladsgroup: [C: 03+2] fix flask/jinja2 semver snafu (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813938 (owner: 10CDanis) [10:08:26] (03Merged) 10jenkins-bot: fix flask/jinja2 semver snafu [software/klaxon] - 10https://gerrit.wikimedia.org/r/813938 (owner: 10CDanis) [10:08:28] (03Merged) 10jenkins-bot: Use ProxyFix middleware to correctly recognize HTTPS usage [software/klaxon] - 10https://gerrit.wikimedia.org/r/794759 (https://phabricator.wikimedia.org/T308941) (owner: 10Legoktm) [10:09:52] (03CR) 10Ladsgroup: [C: 03+2] restore styling accidentally removed in 16f1d6c [software/klaxon] - 10https://gerrit.wikimedia.org/r/813939 (owner: 10CDanis) [10:12:21] (03Merged) 10jenkins-bot: restore styling accidentally removed in 16f1d6c [software/klaxon] - 10https://gerrit.wikimedia.org/r/813939 (owner: 10CDanis) [10:15:26] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) p:05Triage→03Medium [10:20:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P31126 and previous config saved to /var/cache/conftool/dbconfig/20220715-102008-ladsgroup.json [10:35:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T312984)', diff saved to https://phabricator.wikimedia.org/P31127 and previous config saved to /var/cache/conftool/dbconfig/20220715-103513-ladsgroup.json [10:35:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:35:19] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [10:35:28] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [10:41:53] (03PS1) 10Giuseppe Lavagetto: admin: add bgwiki to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/814121 (https://phabricator.wikimedia.org/T312827) [10:43:11] (03PS2) 10Giuseppe Lavagetto: admin: add bgwiki to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/814121 (https://phabricator.wikimedia.org/T312827) [10:43:54] (03PS3) 10Giuseppe Lavagetto: admin: add bgwiki to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/814121 (https://phabricator.wikimedia.org/T312827) [10:45:36] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add bgwiki to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/814121 (https://phabricator.wikimedia.org/T312827) (owner: 10Giuseppe Lavagetto) [10:46:56] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) [10:56:16] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) Hi @Bethany in about 30 minutes you should be able to access all systems and to ssh to the hadoop nodes, and change your kerber... [10:56:16] !log hashar@deploy1002 Started deploy [integration/docroot@e563641]: Add banan-i18n library [10:56:25] !log hashar@deploy1002 Finished deploy [integration/docroot@e563641]: Add banan-i18n library (duration: 00m 08s) [10:56:40] (03PS1) 10Giuseppe Lavagetto: admin: add kerberos to bgwiki [puppet] - 10https://gerrit.wikimedia.org/r/814122 (https://phabricator.wikimedia.org/T312827) [10:57:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:57:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1166.eqiad.wmnet with reason: Maintenance [10:57:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T312984)', diff saved to https://phabricator.wikimedia.org/P31128 and previous config saved to /var/cache/conftool/dbconfig/20220715-105748-ladsgroup.json [10:57:52] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [11:15:26] PROBLEM - Ensure hosts are not performing a change on every puppet run on cumin1001 is CRITICAL: CRITICAL: the following (25) node(s) change every puppet run: aqs2001, aqs2002, aqs2003, aqs2004, aqs2005, aqs2006, aqs2007, aqs2008, aqs2009, aqs2010, aqs2011, aqs2012, cloudservices1003, cloudservices1004, ms-fe1010, ms-fe1011, ms-fe1012, ms-fe2010, ms-fe2011, ms-fe2012, thanos-fe1002, thanos-fe1003, thanos-fe2001, thanos-fe2002, thanos-fe20 [11:15:26] ://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [11:21:40] (03CR) 10Muehlenhoff: [C: 03+1] admin: add kerberos to bgwiki [puppet] - 10https://gerrit.wikimedia.org/r/814122 (https://phabricator.wikimedia.org/T312827) (owner: 10Giuseppe Lavagetto) [11:21:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312984)', diff saved to https://phabricator.wikimedia.org/P31129 and previous config saved to /var/cache/conftool/dbconfig/20220715-112157-ladsgroup.json [11:22:03] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [11:37:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P31130 and previous config saved to /var/cache/conftool/dbconfig/20220715-113702-ladsgroup.json [11:52:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P31131 and previous config saved to /var/cache/conftool/dbconfig/20220715-115207-ladsgroup.json [12:07:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T312984)', diff saved to https://phabricator.wikimedia.org/P31132 and previous config saved to /var/cache/conftool/dbconfig/20220715-120713-ladsgroup.json [12:07:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:07:18] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [12:07:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1112.eqiad.wmnet with reason: Maintenance [12:07:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:07:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [12:07:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T312984)', diff saved to https://phabricator.wikimedia.org/P31133 and previous config saved to /var/cache/conftool/dbconfig/20220715-120750-ladsgroup.json [12:09:17] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10hashar) [12:10:04] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [12:13:53] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10hashar) `pcov` got build and uploaded to `component/php74`.... [12:14:06] 10SRE, 10serviceops, 10Continuous-Integration-Config, 10Release-Engineering-Team (CI & Testing services), 10Test-Coverage: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10hashar) a:05Legoktm→03None [12:21:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312984)', diff saved to https://phabricator.wikimedia.org/P31134 and previous config saved to /var/cache/conftool/dbconfig/20220715-122119-ladsgroup.json [12:21:23] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [12:36:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P31135 and previous config saved to /var/cache/conftool/dbconfig/20220715-123624-ladsgroup.json [12:44:01] 10SRE, 10API Platform, 10Traffic, 10VisualEditor, and 2 others: Find out if Varnish is messing with ETags, and what to do about it. - https://phabricator.wikimedia.org/T310904 (10daniel) 05Open→03Resolved [12:46:31] (03CR) 10Ladsgroup: "I'm taking over mwhahahaha" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [12:47:18] (03CR) 10Marostegui: core.pp: Make sync_binlog and trx_commit configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [12:50:48] (03PS1) 10Hashar: ci: enable docker on machine start [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) [12:51:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P31136 and previous config saved to /var/cache/conftool/dbconfig/20220715-125129-ladsgroup.json [12:53:54] (03CR) 10Hashar: "See T313119#8080674 for the details." [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [12:59:39] (03PS1) 10David Caro: novafullstack: fix timing issue [puppet] - 10https://gerrit.wikimedia.org/r/814162 [13:01:38] (03CR) 10David Caro: [C: 03+2] novafullstack: fix timing issue [puppet] - 10https://gerrit.wikimedia.org/r/814162 (owner: 10David Caro) [13:02:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ayounsi) From diffscan, those two hosts have their SSH port exposed to the world: ` New Open Service List --------------------- STATUS HOST POR... [13:05:21] (03CR) 10Slyngshede: [C: 03+1] "Looks good, minor detail in comment." [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [13:05:21] !log bking@cumin1001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:05:24] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:06:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T312984)', diff saved to https://phabricator.wikimedia.org/P31137 and previous config saved to /var/cache/conftool/dbconfig/20220715-130634-ladsgroup.json [13:06:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:06:39] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:07:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1175.eqiad.wmnet with reason: Maintenance [13:07:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T312984)', diff saved to https://phabricator.wikimedia.org/P31138 and previous config saved to /var/cache/conftool/dbconfig/20220715-130706-ladsgroup.json [13:07:12] (03CR) 10Majavah: "is there any reason not to go with just" [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [13:14:52] RECOVERY - Check systemd state on mw2392 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:52] (03CR) 10Hashar: ci: enable docker on machine start (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [13:19:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312984)', diff saved to https://phabricator.wikimedia.org/P31139 and previous config saved to /var/cache/conftool/dbconfig/20220715-131916-ladsgroup.json [13:19:20] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [13:32:14] (03PS2) 10Hashar: ci: enable docker on machine start [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) [13:33:35] (03CR) 10Hashar: ci: enable docker on machine start (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/814157 (https://phabricator.wikimedia.org/T313119) (owner: 10Hashar) [13:34:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31140 and previous config saved to /var/cache/conftool/dbconfig/20220715-133421-ladsgroup.json [13:42:51] (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: add kerberos to bgwiki [puppet] - 10https://gerrit.wikimedia.org/r/814122 (https://phabricator.wikimedia.org/T312827) (owner: 10Giuseppe Lavagetto) [13:43:45] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/814170 [13:45:35] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for bgwiki / Bethany Gerdemann - https://phabricator.wikimedia.org/T312827 (10Joe) 05Open→03Resolved a:03Joe Tentatively resolving. Please let us know if you have issues by re-opening the task. [13:49:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P31141 and previous config saved to /var/cache/conftool/dbconfig/20220715-134926-ladsgroup.json [13:51:18] 10SRE, 10SRE-Access-Requests: Add Zabe to #mediawiki_security - https://phabricator.wikimedia.org/T313026 (10Joe) 05Open→03Resolved p:05Triage→03Medium a:03Joe [13:53:16] (03PS4) 10David Caro: Change formatting of a few openstack calls [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/810107 (owner: 10Andrew Bogott) [13:53:18] (03PS11) 10David Caro: wmcs: vps: create_instance_with_prefix: unbreak [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/802170 (owner: 10Majavah) [14:04:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T312984)', diff saved to https://phabricator.wikimedia.org/P31143 and previous config saved to /var/cache/conftool/dbconfig/20220715-140431-ladsgroup.json [14:04:33] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:04:36] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [14:04:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1179.eqiad.wmnet with reason: Maintenance [14:04:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T312984)', diff saved to https://phabricator.wikimedia.org/P31144 and previous config saved to /var/cache/conftool/dbconfig/20220715-140451-ladsgroup.json [14:15:06] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:15:15] (03PS1) 10Daniel Kinzler: Make $wgAccountCreationThrottle must be an array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814176 [14:21:51] (03PS2) 10Daniel Kinzler: Make $wgAccountCreationThrottle an array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814176 [14:26:36] (03CR) 10RhinosF1: [C: 03+1] Make $wgAccountCreationThrottle an array. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/814176 (owner: 10Daniel Kinzler) [14:32:34] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10MoritzMuehlenhoff) @Marostegui Did this happen again for any reimage after I merged by patch above? [14:33:45] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10Marostegui) Nope, it all went fine! Good to close. I need to decom a lot more in the upcoming days, will reopen if needed. Thanks for fixing it! [14:40:18] 10SRE-tools, 10Infrastructure-Foundations: Decommissioning two hosts end up with: Failed to wipe swraid - https://phabricator.wikimedia.org/T311593 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff Ack, closing then :-) [14:47:29] (03CR) 10RhinosF1: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [14:47:54] (03CR) 10CI reject: [V: 04-1] beta: [fix ci - until next week] add --skip-config-validation to update.php [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [14:49:04] (03PS3) 10RhinosF1: beta: [fix ci - until next week] add --skip-config-validation to update.php [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) [14:49:42] (03CR) 10CI reject: [V: 04-1] beta: [fix ci - until next week] add --skip-config-validation to update.php [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [14:52:00] (03PS4) 10RhinosF1: beta: [fix ci - until next week] add --skip-config-validation to update.php [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) [14:55:56] (03CR) 10Samtar: [C: 03+1] "Echoing my (limited understanding) comment on IRC that `--skip-config-validation` feels a bit 😐 that being said, it'd only affect beta if " [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [14:57:08] anyone SRE wise wish to merge ^ [15:04:12] (03CR) 10Bking: [V: 03+2] beta: [fix ci - until next week] add --skip-config-validation to update.php [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [15:05:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312984)', diff saved to https://phabricator.wikimedia.org/P31146 and previous config saved to /var/cache/conftool/dbconfig/20220715-150505-ladsgroup.json [15:05:11] (03CR) 10Bking: [V: 03+2 C: 03+2] beta: [fix ci - until next week] add --skip-config-validation to update.php [puppet] - 10https://gerrit.wikimedia.org/r/814134 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [15:05:12] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [15:13:59] (03PS1) 10Cmjohnson: updating site.pp for cloudweb servers, setup incorrectly for private vlan [puppet] - 10https://gerrit.wikimedia.org/r/814185 (https://phabricator.wikimedia.org/T305414) [15:14:34] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:28] RECOVERY - SSH on db1110.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:16:41] (03CR) 10Cmjohnson: [C: 03+2] updating site.pp for cloudweb servers, setup incorrectly for private vlan [puppet] - 10https://gerrit.wikimedia.org/r/814185 (https://phabricator.wikimedia.org/T305414) (owner: 10Cmjohnson) [15:19:56] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] beta: fix multiline string being treated as 2 commands. [puppet] - 10https://gerrit.wikimedia.org/r/814135 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [15:20:04] (03PS4) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [15:20:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31147 and previous config saved to /var/cache/conftool/dbconfig/20220715-152010-ladsgroup.json [15:23:04] (03CR) 10Bking: [C: 03+2] beta: fix multiline string being treated as 2 commands. [puppet] - 10https://gerrit.wikimedia.org/r/814135 (https://phabricator.wikimedia.org/T313128) (owner: 10RhinosF1) [15:28:42] (03CR) 10Ladsgroup: "Seems to be working fine: https://puppet-compiler.wmflabs.org/pcc-worker1002/36278/db2144.codfw.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [15:31:10] (03CR) 10Joal: "Small nits,and I think you have been missing file: puppet/modules/profile/manifests/analytics/refinery/job/test/data_purge.pp" [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [15:35:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P31148 and previous config saved to /var/cache/conftool/dbconfig/20220715-153515-ladsgroup.json [15:48:32] RECOVERY - Check systemd state on contint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T312984)', diff saved to https://phabricator.wikimedia.org/P31149 and previous config saved to /var/cache/conftool/dbconfig/20220715-155021-ladsgroup.json [15:50:24] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:50:25] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [15:50:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2105.codfw.wmnet with reason: Maintenance [15:50:39] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 6 hosts with reason: Maintenance [15:50:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 6 hosts with reason: Maintenance [15:51:02] joal: thanks for the review :] would you be available for a quick chat about that? if not, that's ok! since silent-friday. We can do async [16:16:02] RECOVERY - SSH on wtp1044.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:16:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:16:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [16:20:12] (03CR) 10Marostegui: [C: 04-1] "parsercache hosts should not have that enabled, both parameters should be set to 0 there. They are currently 0 and should remain like that" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [16:30:55] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [16:33:04] 10SRE, 10DSE-Kubernetes-Cluster, 10Infrastructure-Foundations, 10vm-requests: Site: eqiad : 3 VMs requested for Etcd cluster in support of the new DSE Kubernetes cluster - https://phabricator.wikimedia.org/T311131 (10BTullis) [16:37:08] (03CR) 10Cwhite: [C: 03+2] loki-beta: increase grpc message size [puppet] - 10https://gerrit.wikimedia.org/r/813985 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [16:40:10] (03CR) 10Mforns: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [16:40:20] PROBLEM - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:42:52] ACKNOWLEDGEMENT - MegaRAID on an-worker1093 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Investigating - T313130 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:57:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:57:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:57:26] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on 8 hosts with reason: Maintenance [16:57:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on 8 hosts with reason: Maintenance [16:59:39] (03PS5) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [16:59:49] (03CR) 10Ladsgroup: core.pp: Make sync_binlog and trx_commit configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [17:00:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:00:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [17:02:00] (03CR) 10Mforns: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [17:03:35] 10SRE-swift-storage: Spike in Swift errors - https://phabricator.wikimedia.org/T313102 (10MatthewVernon) @tstarling thanks for fixing. What is surprising to me at least is that the grafana swift dashboards don't reflect this - you can see a brief spike in read errors around the failure on 12th July, but then ba... [17:05:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:05:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1156.eqiad.wmnet with reason: Maintenance [17:05:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:05:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 20:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:05:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T312984)', diff saved to https://phabricator.wikimedia.org/P31150 and previous config saved to /var/cache/conftool/dbconfig/20220715-170545-ladsgroup.json [17:05:50] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:05:56] PROBLEM - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is CRITICAL: CRITICAL - failed 92 probes of 678 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:06:14] (03CR) 10Ori: [C: 03+2] New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [17:10:48] (03Merged) 10jenkins-bot: New service: function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/793862 (https://phabricator.wikimedia.org/T295698) (owner: 10Ori) [17:12:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312984)', diff saved to https://phabricator.wikimedia.org/P31151 and previous config saved to /var/cache/conftool/dbconfig/20220715-171246-ladsgroup.json [17:12:50] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:14:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [17:18:58] RECOVERY - IPv6 ping to ulsfo on ripe-atlas-ulsfo IPv6 is OK: OK - failed 57 probes of 677 (alerts on 90) - https://atlas.ripe.net/measurements/1791309/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:19:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: (2) Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [17:20:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1003.wikimedia.org with OS bullseye [17:20:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye [17:27:06] (03PS1) 10Majavah: dynamicproxy: urlproxy: add a simple rate limit [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) [17:27:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P31152 and previous config saved to /var/cache/conftool/dbconfig/20220715-172751-ladsgroup.json [17:27:54] (03CR) 10CI reject: [V: 04-1] dynamicproxy: urlproxy: add a simple rate limit [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) (owner: 10Majavah) [17:28:24] (03PS2) 10Majavah: dynamicproxy: urlproxy: add a simple rate limit [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) [17:31:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [17:31:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host cloudweb1004.wikimedia.org with OS bullseye [17:31:44] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye [17:35:15] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1003.wikimedia.org with reason: host reimage [17:35:57] (03PS3) 10Mforns: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) [17:36:18] RECOVERY - MegaRAID on an-worker1093 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:42:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P31154 and previous config saved to /var/cache/conftool/dbconfig/20220715-174256-ladsgroup.json [17:43:15] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [17:46:53] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudweb1004.wikimedia.org with reason: host reimage [17:48:04] (03CR) 10Joal: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [17:48:34] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1003.wikimedia.org with OS bullseye [17:48:40] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1003.wikimedia.org with OS bullseye co... [17:55:29] (03CR) 10BryanDavis: dynamicproxy: urlproxy: add a simple rate limit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/814193 (https://phabricator.wikimedia.org/T313131) (owner: 10Majavah) [17:58:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T312984)', diff saved to https://phabricator.wikimedia.org/P31155 and previous config saved to /var/cache/conftool/dbconfig/20220715-175801-ladsgroup.json [17:58:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:58:07] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [17:58:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1146.eqiad.wmnet with reason: Maintenance [17:58:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31156 and previous config saved to /var/cache/conftool/dbconfig/20220715-175822-ladsgroup.json [18:01:09] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudweb1004.wikimedia.org with OS bullseye [18:01:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host cloudweb1004.wikimedia.org with OS bullseye co... [18:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31157 and previous config saved to /var/cache/conftool/dbconfig/20220715-180532-ladsgroup.json [18:05:37] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:20:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P31158 and previous config saved to /var/cache/conftool/dbconfig/20220715-182037-ladsgroup.json [18:30:17] (03CR) 10Marostegui: "thanks, on Monday I'll pick a host from every section (master and slave) and run PPC to make sure we are not changing it where we shouldn'" [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [18:30:32] !log T300943 Re-imaging `elastic20[61-72]` from buster -> bullseye, one host at a time. These hosts are not in service currently so re-imaging is safe. [18:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:37] T300943: Service implementation for elastic20[61-86].codfw.wmnet - https://phabricator.wikimedia.org/T300943 [18:31:04] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2061.codfw.wmnet with OS bullseye [18:35:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P31159 and previous config saved to /var/cache/conftool/dbconfig/20220715-183542-ladsgroup.json [18:44:53] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2061.codfw.wmnet with reason: host reimage [18:47:27] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2061.codfw.wmnet with reason: host reimage [18:50:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31160 and previous config saved to /var/cache/conftool/dbconfig/20220715-185047-ladsgroup.json [18:50:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:50:51] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:51:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1105.eqiad.wmnet with reason: Maintenance [18:51:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1105:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31161 and previous config saved to /var/cache/conftool/dbconfig/20220715-185107-ladsgroup.json [18:56:56] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) [18:57:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q4:(Need By: TBD) rack/setup/install cloudweb100[34] - https://phabricator.wikimedia.org/T305414 (10Cmjohnson) 05Open→03Resolved [18:58:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31162 and previous config saved to /var/cache/conftool/dbconfig/20220715-185842-ladsgroup.json [18:58:48] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [18:59:14] (03PS4) 10Mforns: analytics:refinery:job:data_purge: Add --allowed-interval to deletion jobs [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) [19:01:19] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2061.codfw.wmnet with OS bullseye [19:01:38] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2062.codfw.wmnet with OS bullseye [19:07:48] (03CR) 10Mforns: [V: 04-1] "I think this is ready." [puppet] - 10https://gerrit.wikimedia.org/r/813921 (https://phabricator.wikimedia.org/T270433) (owner: 10Mforns) [19:13:22] (03CR) 10CDanis: Don't hardcode v1 of the api in the base path (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 (owner: 10CDanis) [19:13:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P31163 and previous config saved to /var/cache/conftool/dbconfig/20220715-191347-ladsgroup.json [19:15:28] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2062.codfw.wmnet with reason: host reimage [19:18:02] PROBLEM - SSH on db1109.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:18:27] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2062.codfw.wmnet with reason: host reimage [19:26:15] 10SRE, 10ops-codfw, 10Discovery-Search, 10Elasticsearch, 10Patch-For-Review: Degraded RAID on elastic2049 - https://phabricator.wikimedia.org/T311939 (10Papaul) I looked into this yesterday and today, it looks like we are having some HW issues on this server and unfortunately the server is out of warran... [19:26:41] 10SRE, 10MediaWiki-General, 10Traffic-Icebox, 10Patch-For-Review: Investigate query parameter normalization for MW/services - https://phabricator.wikimedia.org/T138093 (10ori) OK, current status: * libvmod-querysort is [[ https://gerrit.wikimedia.org/g/operations/software/varnish/libvmod-querysort | in Ge... [19:28:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312', diff saved to https://phabricator.wikimedia.org/P31164 and previous config saved to /var/cache/conftool/dbconfig/20220715-192852-ladsgroup.json [19:31:10] (03PS2) 10CDanis: Don't hardcode v1 of the api in the base path [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 [19:31:12] (03PS2) 10CDanis: Add support for fetching current oncallers [software/klaxon] - 10https://gerrit.wikimedia.org/r/813941 [19:31:15] (03PS2) 10CDanis: display current oncallers in Klaxon UI [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 [19:31:51] (03CR) 10CDanis: Add support for fetching current oncallers (033 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813941 (owner: 10CDanis) [19:32:00] (03CR) 10CDanis: display current oncallers in Klaxon UI (032 comments) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 (owner: 10CDanis) [19:32:03] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2062.codfw.wmnet with OS bullseye [19:33:42] (03CR) 10CDanis: [C: 03+2] Don't hardcode v1 of the api in the base path [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 (owner: 10CDanis) [19:33:54] (03CR) 10CDanis: [C: 03+2] Add support for fetching current oncallers [software/klaxon] - 10https://gerrit.wikimedia.org/r/813941 (owner: 10CDanis) [19:37:39] (03Merged) 10jenkins-bot: Don't hardcode v1 of the api in the base path [software/klaxon] - 10https://gerrit.wikimedia.org/r/813940 (owner: 10CDanis) [19:37:41] (03Merged) 10jenkins-bot: Add support for fetching current oncallers [software/klaxon] - 10https://gerrit.wikimedia.org/r/813941 (owner: 10CDanis) [19:42:55] (03PS1) 10CDanis: Revert "Use ProxyFix middleware to correctly recognize HTTPS usage" [software/klaxon] - 10https://gerrit.wikimedia.org/r/814139 [19:43:28] (03CR) 10CDanis: [C: 03+2] Revert "Use ProxyFix middleware to correctly recognize HTTPS usage" [software/klaxon] - 10https://gerrit.wikimedia.org/r/814139 (owner: 10CDanis) [19:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1105:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31165 and previous config saved to /var/cache/conftool/dbconfig/20220715-194358-ladsgroup.json [19:43:59] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:44:03] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [19:44:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1122.eqiad.wmnet with reason: Maintenance [19:44:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1122 (T312984)', diff saved to https://phabricator.wikimedia.org/P31166 and previous config saved to /var/cache/conftool/dbconfig/20220715-194418-ladsgroup.json [19:45:17] (03Merged) 10jenkins-bot: Revert "Use ProxyFix middleware to correctly recognize HTTPS usage" [software/klaxon] - 10https://gerrit.wikimedia.org/r/814139 (owner: 10CDanis) [19:47:23] cdanis: bahhhh [19:47:26] 10SRE, 10Sustainability (Incident Followup): Klaxon redirects to http://klaxon.wikimedia.org (not https) - https://phabricator.wikimedia.org/T308941 (10CDanis) Unfortunately I had to revert the above patch because the necessary middlewear library isn't included in Debian's `python3-werkzeug` until Bullseye. [19:47:32] legoktm: i KNOW [19:47:41] :( [19:50:04] cdanis: https://sources.debian.org/src/python-werkzeug/0.14.1%2Bdfsg1-4%2Bdeb10u1/werkzeug/contrib/fixers.py/#L97 [19:50:20] ahahaha [19:50:22] okay, thanks [19:50:26] I was just looking at file names [19:50:27] so try / except ImportException with the legacy name [19:50:29] I'll write a new one with -- yeah, that [19:50:36] <3 [19:51:19] (03PS3) 10CDanis: display current oncallers in Klaxon UI [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 [19:51:48] I remembered ProxyFix being a pretty old thing, so I codesearched in Debian and found https://sources.debian.org/src/flask-login/0.5.0-2/test_login.py/?hl=24#L24 [19:53:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312984)', diff saved to https://phabricator.wikimedia.org/P31167 and previous config saved to /var/cache/conftool/dbconfig/20220715-195334-ladsgroup.json [19:53:39] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:08:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P31168 and previous config saved to /var/cache/conftool/dbconfig/20220715-200839-ladsgroup.json [20:10:24] (03CR) 10CDanis: [C: 03+2] display current oncallers in Klaxon UI (031 comment) [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 (owner: 10CDanis) [20:12:41] (03Merged) 10jenkins-bot: display current oncallers in Klaxon UI [software/klaxon] - 10https://gerrit.wikimedia.org/r/813942 (owner: 10CDanis) [20:14:52] (03PS1) 10CDanis: Use ProxyFix middleware to recognize HTTPS usage, attempt #2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/814251 (https://phabricator.wikimedia.org/T308941) [20:16:52] (03CR) 10Legoktm: [C: 03+1] "LGTM!" [software/klaxon] - 10https://gerrit.wikimedia.org/r/814251 (https://phabricator.wikimedia.org/T308941) (owner: 10CDanis) [20:17:51] (03CR) 10CDanis: [C: 03+2] Use ProxyFix middleware to recognize HTTPS usage, attempt #2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/814251 (https://phabricator.wikimedia.org/T308941) (owner: 10CDanis) [20:19:09] :shipit: [20:19:26] RECOVERY - SSH on db1109.mgmt is OK: SSH OK - OpenSSH_7.4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:20:27] (03Merged) 10jenkins-bot: Use ProxyFix middleware to recognize HTTPS usage, attempt #2 [software/klaxon] - 10https://gerrit.wikimedia.org/r/814251 (https://phabricator.wikimedia.org/T308941) (owner: 10CDanis) [20:23:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122', diff saved to https://phabricator.wikimedia.org/P31169 and previous config saved to /var/cache/conftool/dbconfig/20220715-202344-ladsgroup.json [20:28:30] 10SRE, 10Patch-For-Review, 10Sustainability (Incident Followup): Klaxon redirects to http://klaxon.wikimedia.org (not https) - https://phabricator.wikimedia.org/T308941 (10CDanis) 05Open→03Resolved Thanks to @legoktm for catching that this was packaged pre-Bullseye under a different name. [20:29:06] :D [20:38:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1122 (T312984)', diff saved to https://phabricator.wikimedia.org/P31170 and previous config saved to /var/cache/conftool/dbconfig/20220715-203849-ladsgroup.json [20:38:51] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [20:38:54] !log ryankemper@cumin1001 START - Cookbook sre.hosts.reimage for host elastic2063.codfw.wmnet with OS bullseye [20:38:54] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:39:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1170.eqiad.wmnet with reason: Maintenance [20:39:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31171 and previous config saved to /var/cache/conftool/dbconfig/20220715-203909-ladsgroup.json [20:46:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31172 and previous config saved to /var/cache/conftool/dbconfig/20220715-204617-ladsgroup.json [20:46:22] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [20:52:43] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2063.codfw.wmnet with reason: host reimage [20:55:10] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2063.codfw.wmnet with reason: host reimage [20:57:42] (03PS1) 10CDanis: Show oncallers in both locations of The Button [software/klaxon] - 10https://gerrit.wikimedia.org/r/814255 [21:00:15] (03CR) 10CDanis: [C: 03+2] Show oncallers in both locations of The Button [software/klaxon] - 10https://gerrit.wikimedia.org/r/814255 (owner: 10CDanis) [21:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P31173 and previous config saved to /var/cache/conftool/dbconfig/20220715-210122-ladsgroup.json [21:01:48] (03Merged) 10jenkins-bot: Show oncallers in both locations of The Button [software/klaxon] - 10https://gerrit.wikimedia.org/r/814255 (owner: 10CDanis) [21:04:35] (03CR) 10Krinkle: core.pp: Make sync_binlog and trx_commit configurable (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/813917 (owner: 10Marostegui) [21:08:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2063.codfw.wmnet with OS bullseye [21:16:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P31174 and previous config saved to /var/cache/conftool/dbconfig/20220715-211628-ladsgroup.json [21:31:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T312984)', diff saved to https://phabricator.wikimedia.org/P31175 and previous config saved to /var/cache/conftool/dbconfig/20220715-213133-ladsgroup.json [21:31:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:31:37] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:31:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1182.eqiad.wmnet with reason: Maintenance [21:31:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T312984)', diff saved to https://phabricator.wikimedia.org/P31176 and previous config saved to /var/cache/conftool/dbconfig/20220715-213153-ladsgroup.json [21:33:12] (03CR) 10Krinkle: [C: 04-1] Move CirrusSearch settings from IS.php to ext-CirrusSearch.php (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/799272 (https://phabricator.wikimedia.org/T308932) (owner: 10Ladsgroup) [21:38:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312984)', diff saved to https://phabricator.wikimedia.org/P31177 and previous config saved to /var/cache/conftool/dbconfig/20220715-213852-ladsgroup.json [21:38:56] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [21:41:40] (03CR) 10Dzahn: "Could we set the severity to critical-but-not-paging first and then upgrade it to that after confirming everything?" [puppet] - 10https://gerrit.wikimedia.org/r/812846 (https://phabricator.wikimedia.org/T305847) (owner: 10Filippo Giunchedi) [21:50:50] PROBLEM - SSH on mw1321.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:53:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31178 and previous config saved to /var/cache/conftool/dbconfig/20220715-215357-ladsgroup.json [22:09:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P31179 and previous config saved to /var/cache/conftool/dbconfig/20220715-220902-ladsgroup.json [22:24:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T312984)', diff saved to https://phabricator.wikimedia.org/P31180 and previous config saved to /var/cache/conftool/dbconfig/20220715-222407-ladsgroup.json [22:24:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [22:24:11] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [22:24:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1129.eqiad.wmnet with reason: Maintenance [22:24:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T312984)', diff saved to https://phabricator.wikimedia.org/P31181 and previous config saved to /var/cache/conftool/dbconfig/20220715-222427-ladsgroup.json [22:28:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312984)', diff saved to https://phabricator.wikimedia.org/P31182 and previous config saved to /var/cache/conftool/dbconfig/20220715-222845-ladsgroup.json [22:42:54] PROBLEM - SSH on wtp1044.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:43:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31183 and previous config saved to /var/cache/conftool/dbconfig/20220715-224350-ladsgroup.json [22:52:12] RECOVERY - SSH on mw1321.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:58:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P31184 and previous config saved to /var/cache/conftool/dbconfig/20220715-225855-ladsgroup.json [23:14:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T312984)', diff saved to https://phabricator.wikimedia.org/P31185 and previous config saved to /var/cache/conftool/dbconfig/20220715-231400-ladsgroup.json [23:14:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:14:06] T312984: Adjust the field type of flaggedpages.fp_pending_since to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T312984 [23:14:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1139.eqiad.wmnet with reason: Maintenance [23:20:11] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:20:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db1102.eqiad.wmnet with reason: Maintenance [23:24:52] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:26:16] PROBLEM - SSH on db1110.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [23:57:36] (03PS3) 10Cwhite: hiera: deploy and enable loki on grafana hosts [puppet] - 10https://gerrit.wikimedia.org/r/813724 (https://phabricator.wikimedia.org/T222826) [23:59:33] (03CR) 10Cwhite: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/799001 (https://phabricator.wikimedia.org/T305175) (owner: 10Cwhite)