[00:01:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [00:01:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:04:03] PROBLEM - MariaDB Replica SQL: s6 on db1155 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1034, Errmsg: Could not execute Write_rows_v1 event on table ruwiki.pagelinks: Index for table pagelinks is corrupt: try to repair it, Error_code: 1034: handler error HA_ERR_CRASHED: the events master log db1165-bin.003564, end_log_pos 548165287 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:06:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [00:06:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:10:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9828262 (10Dzahn) 05In progress→03Stalled [00:10:23] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9828263 (10Dzahn) 05In progress→03Stalled [00:10:31] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9828264 (10Dzahn) 05In progress→03Stalled [00:10:54] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9828265 (10Dzahn) 05In progress→03Stalled [00:11:49] PROBLEM - MariaDB Replica Lag: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 636.99 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:11:49] PROBLEM - MariaDB Replica Lag: s6 on db1155 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 637.02 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:05] PROBLEM - MariaDB Replica Lag: s6 on clouddb1015 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 651.92 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:07] PROBLEM - MariaDB Replica Lag: s6 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.05 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:12:27] 06SRE, 10LDAP-Access-Requests: Grant Access to nda for Ricki Jay - https://phabricator.wikimedia.org/T365138#9828268 (10Dzahn) 05In progress→03Stalled stalled waiting for user input - handing over to next week's clinic duty [00:13:17] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9828266 (10Dzahn) 05In progress→03Stalled stalled waiting for approval - handing over to next week's clinic duty [00:13:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P63046 and previous config saved to /var/cache/conftool/dbconfig/20240524-001326-marostegui.json [00:13:56] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9828271 (10Dzahn) stalled waiting for approval - handing over to next week's clinic duty [00:14:05] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9828272 (10Dzahn) stalled waiting for approval - handing over to next week's clinic duty [00:14:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9828273 (10Dzahn) stalled waiting for approval - handing over to next week's clinic duty [00:15:01] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9828275 (10Dzahn) stalled waiting for NDA - handing over to next week's clinic duty [00:16:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [00:16:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:17:25] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9828277 (10Dzahn) Welcome to WMF! We can handle the access to the wmf group here fairly quickly. analytics-privatedata-users isn't an LDAP group though. That would be a... [00:21:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [00:21:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:22:53] (03PS6) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [00:28:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217', diff saved to https://phabricator.wikimedia.org/P63047 and previous config saved to /var/cache/conftool/dbconfig/20240524-002834-marostegui.json [00:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:37:40] (03CR) 10Eevans: [C:03+1] services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [00:39:42] (03CR) 10Scott French: Migrate AQS2 services and image-suggestions to calico network policies (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [00:43:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2217 (T364299)', diff saved to https://phabricator.wikimedia.org/P63048 and previous config saved to /var/cache/conftool/dbconfig/20240524-004342-marostegui.json [00:43:48] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [00:46:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [00:46:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [00:49:12] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf/analytics-privatedata-users for derenrich - https://phabricator.wikimedia.org/T365381#9828349 (10derenrich) approval from who? [00:51:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [00:51:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:10:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [01:10:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:25:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:35:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:43:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [01:43:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:53:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [01:53:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [01:56:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [01:56:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:01:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:04:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:09:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:16:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [02:16:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:21:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [02:21:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:34:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:58:45] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:58:49] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:33:37] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [03:35:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:54:48] (03CR) 10Pppery: "The Meta interwiki map section is in the same order as the Metawiki interwiki map, which is not quite alpha-sorted but if that is fixed it" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035417 (owner: 10Reedy) [04:23:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [04:23:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [04:23:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:23:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:23:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T364299)', diff saved to https://phabricator.wikimedia.org/P63049 and previous config saved to /var/cache/conftool/dbconfig/20240524-042358-marostegui.json [04:24:03] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [04:29:24] (03PS1) 10Marostegui: db2187: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035604 [04:29:42] (03PS1) 10Marostegui: Revert "db1238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035619 [04:29:48] (03CR) 10Marostegui: [C:03+2] db2187: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035604 (owner: 10Marostegui) [04:30:11] (03CR) 10Marostegui: [C:03+2] Revert "db1238: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035619 (owner: 10Marostegui) [04:34:34] (03PS1) 10Marostegui: db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035605 (https://phabricator.wikimedia.org/T362745) [04:34:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2122', diff saved to https://phabricator.wikimedia.org/P63050 and previous config saved to /var/cache/conftool/dbconfig/20240524-043441-root.json [04:35:11] (03CR) 10Marostegui: [C:03+2] db2122: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035605 (https://phabricator.wikimedia.org/T362745) (owner: 10Marostegui) [04:36:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2122.codfw.wmnet with OS bookworm [04:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:43:37] PROBLEM - Backup freshness on backup1001 is CRITICAL: All failures: 1 (lists2001), Fresh: 142 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [04:54:07] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2214 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1034939 (https://phabricator.wikimedia.org/T365783) [04:56:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2122.codfw.wmnet with reason: host reimage [04:58:01] (03PS1) 10Marostegui: Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035620 [04:59:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2122.codfw.wmnet with reason: host reimage [05:00:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:10:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T364299)', diff saved to https://phabricator.wikimedia.org/P63051 and previous config saved to /var/cache/conftool/dbconfig/20240524-051028-marostegui.json [05:10:35] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:17:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P63052 and previous config saved to /var/cache/conftool/dbconfig/20240524-051744-root.json [05:18:15] (03CR) 10Marostegui: [C:03+2] Revert "db2122: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1035620 (owner: 10Marostegui) [05:20:21] (03PS1) 10Marostegui: db2122: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035627 [05:21:01] (03CR) 10Marostegui: [C:03+2] db2122: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035627 (owner: 10Marostegui) [05:23:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2122.codfw.wmnet with OS bookworm [05:25:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P63053 and previous config saved to /var/cache/conftool/dbconfig/20240524-052537-marostegui.json [05:25:39] (03PS1) 10Marostegui: db2137: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035628 [05:26:14] (03CR) 10Marostegui: [C:03+2] db2137: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1035628 (owner: 10Marostegui) [05:32:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P63054 and previous config saved to /var/cache/conftool/dbconfig/20240524-053250-root.json [05:40:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P63055 and previous config saved to /var/cache/conftool/dbconfig/20240524-054045-marostegui.json [05:43:36] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 143 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:48:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P63056 and previous config saved to /var/cache/conftool/dbconfig/20240524-054759-root.json [05:55:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T364299)', diff saved to https://phabricator.wikimedia.org/P63057 and previous config saved to /var/cache/conftool/dbconfig/20240524-055553-marostegui.json [05:55:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [05:55:59] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:56:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [05:56:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63058 and previous config saved to /var/cache/conftool/dbconfig/20240524-055616-marostegui.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240524T0600) [06:03:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P63059 and previous config saved to /var/cache/conftool/dbconfig/20240524-060305-root.json [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:18:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2122 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P63060 and previous config saved to /var/cache/conftool/dbconfig/20240524-061812-root.json [06:24:26] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:25:21] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:28:23] (03PS1) 10Muehlenhoff: Remove SSH key for Kwaku [puppet] - 10https://gerrit.wikimedia.org/r/1035630 [06:32:51] (03CR) 10Muehlenhoff: [C:03+2] Remove SSH key for Kwaku [puppet] - 10https://gerrit.wikimedia.org/r/1035630 (owner: 10Muehlenhoff) [06:40:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63061 and previous config saved to /var/cache/conftool/dbconfig/20240524-064053-marostegui.json [06:40:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:53:24] (03PS4) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) [06:55:42] (03PS1) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [06:55:46] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) (owner: 10Fabfur) [06:56:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P63062 and previous config saved to /var/cache/conftool/dbconfig/20240524-065600-marostegui.json [06:58:31] (03CR) 10CI reject: [V:04-1] tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240524T0700) [07:11:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P63063 and previous config saved to /var/cache/conftool/dbconfig/20240524-071108-marostegui.json [07:13:15] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:13:19] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:15:07] (03PS2) 10Ayounsi: homer: comments-only change: specify 198.35.27.0/24 as ns2 [homer/public] - 10https://gerrit.wikimedia.org/r/1032522 (owner: 10Ssingh) [07:16:33] (03CR) 10Ayounsi: [C:03+2] homer: comments-only change: specify 198.35.27.0/24 as ns2 [homer/public] - 10https://gerrit.wikimedia.org/r/1032522 (owner: 10Ssingh) [07:17:05] (03Merged) 10jenkins-bot: homer: comments-only change: specify 198.35.27.0/24 as ns2 [homer/public] - 10https://gerrit.wikimedia.org/r/1032522 (owner: 10Ssingh) [07:19:26] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:24:19] PROBLEM - BGP status on cr2-eqdfw is CRITICAL: BGP CRITICAL - AS2914/IPv6: Active - NTT, AS2914/IPv4: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:25:21] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:26:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T364299)', diff saved to https://phabricator.wikimedia.org/P63064 and previous config saved to /var/cache/conftool/dbconfig/20240524-072616-marostegui.json [07:26:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:26:27] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:26:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:26:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T364299)', diff saved to https://phabricator.wikimedia.org/P63065 and previous config saved to /var/cache/conftool/dbconfig/20240524-072639-marostegui.json [07:31:15] (03PS1) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [07:31:28] (03CR) 10CI reject: [V:04-1] memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [07:31:36] (03PS2) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [07:32:03] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9828629 (10Jelto) [07:32:54] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9828631 (10Jelto) [07:32:58] 06SRE, 10Wikimedia-Mailing-lists, 07Datacenter-Switchover: Make mailman3 work in the standby host (lists2001.wikimedia.org) - https://phabricator.wikimedia.org/T283615#9828632 (10Jelto) [07:33:11] (03PS1) 10DCausse: cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1035634 [07:33:33] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:33:35] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:33:37] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [07:34:31] (03CR) 10CI reject: [V:04-1] memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [07:34:59] (03CR) 10CI reject: [V:04-1] cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1035634 (owner: 10DCausse) [07:35:08] (03PS2) 10DCausse: cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1035634 [07:36:48] (03CR) 10Gehel: [C:03+2] cirrus: relax CirrusConsumerRerenderFetchErrorRate [alerts] - 10https://gerrit.wikimedia.org/r/1035634 (owner: 10DCausse) [07:36:57] (03PS3) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [07:37:31] (03PS4) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [07:38:35] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:40:20] !log dcausse@deploy1002 Started deploy [airflow-dags/search@8f0b4a1]: search: fix import_ttl dag [07:40:40] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@8f0b4a1]: search: fix import_ttl dag (duration: 00m 19s) [07:51:37] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 71, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:52:37] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:52:39] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [07:57:33] (03PS5) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [08:03:14] (03PS6) 10Effie Mouzeli: memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [08:07:57] (03PS7) 10Effie Mouzeli: (WIP) memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) [08:08:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T364299)', diff saved to https://phabricator.wikimedia.org/P63066 and previous config saved to /var/cache/conftool/dbconfig/20240524-080835-marostegui.json [08:08:41] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:10:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:10:42] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:17:39] RECOVERY - BGP status on cr2-eqdfw is OK: BGP OK - up: 201, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:20:14] (03PS1) 10Vgutierrez: lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) [08:23:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P63067 and previous config saved to /var/cache/conftool/dbconfig/20240524-082343-marostegui.json [08:25:18] (03PS5) 10Fabfur: benthos:cache: switch to rfc5424 format [puppet] - 10https://gerrit.wikimedia.org/r/1035440 (https://phabricator.wikimedia.org/T365718) [08:25:58] (03CR) 10Filippo Giunchedi: [C:03+1] trafficserver: point pyrra to thanos discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1035541 (https://phabricator.wikimedia.org/T356386) (owner: 10Herron) [08:36:20] (03PS2) 10Sergio Gimeno: [Beta] eswiki: disable personalized praise [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035473 (https://phabricator.wikimedia.org/T359038) [08:36:20] (03PS1) 10Sergio Gimeno: [Beta] cswiki: enable CommunityConfiguration for GrowthExperiments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) [08:36:42] (03CR) 10Sergio Gimeno: [C:04-1] "Not yet" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035726 (https://phabricator.wikimedia.org/T364892) (owner: 10Sergio Gimeno) [08:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P63068 and previous config saved to /var/cache/conftool/dbconfig/20240524-083851-marostegui.json [08:41:00] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:48:00] (03CR) 10DCausse: "thanks for the review!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [08:48:08] (03PS9) 10DCausse: wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 [08:48:09] (03PS18) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [08:54:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T364299)', diff saved to https://phabricator.wikimedia.org/P63069 and previous config saved to /var/cache/conftool/dbconfig/20240524-085400-marostegui.json [08:54:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:54:05] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:54:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [08:54:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T364299)', diff saved to https://phabricator.wikimedia.org/P63070 and previous config saved to /var/cache/conftool/dbconfig/20240524-085423-marostegui.json [09:00:59] (03CR) 10Stevemunene: [C:03+2] idp-test: Change datahub staging url [puppet] - 10https://gerrit.wikimedia.org/r/1035414 (https://phabricator.wikimedia.org/T365674) (owner: 10Stevemunene) [09:04:47] (03PS2) 10Muehlenhoff: tlsproxy::envoy: Remove support for legacy sslcert provider [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) [09:12:01] (03CR) 10Hashar: [C:03+2] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [09:12:35] (03Merged) 10jenkins-bot: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [09:13:09] RECOVERY - MariaDB Replica SQL: s6 on db1155 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:14:35] (03CR) 10Hashar: [C:03+2] "There are some glitches I am not entirely happy such as the ongoing triggered job adding a duplicate "test" under a Running section, but t" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [09:15:59] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:21:33] (03CR) 10Klausman: [C:03+1] api-gateway: add normalise_paths option, enable in api-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035481 (https://phabricator.wikimedia.org/T365439) (owner: 10Hnowlan) [09:25:11] !log hashar@deploy1002 Started deploy [gerrit/gerrit@159288a]: Allow users to recheck tests in checkers - T363918 [09:25:15] T363918: Gerrit recheck button - https://phabricator.wikimedia.org/T363918 [09:25:18] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@159288a]: Allow users to recheck tests in checkers - T363918 (duration: 00m 07s) [09:25:36] (03CR) 10Klausman: [C:03+1] amd-pytorch: refactor the common bits to DRY the Dockerfiles (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 (owner: 10Elukey) [09:25:49] (03PS1) 10Stevemunene: trafficserver: add datahub redirects to ATS [puppet] - 10https://gerrit.wikimedia.org/r/1035731 (https://phabricator.wikimedia.org/T365668) [09:26:31] (03PS2) 10Stevemunene: trafficserver: add datahub-next redirects [puppet] - 10https://gerrit.wikimedia.org/r/1035268 (https://phabricator.wikimedia.org/T365668) [09:37:10] (03PS2) 10Volans: sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) [09:37:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T364299)', diff saved to https://phabricator.wikimedia.org/P63071 and previous config saved to /var/cache/conftool/dbconfig/20240524-093751-marostegui.json [09:37:57] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [09:38:29] PROBLEM - MariaDB Replica Lag: s1 on db2176 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 655.58 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:40:49] (03PS1) 10DCausse: [WIP] changeprop: drop CirrusSearch changeprop settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 [09:42:21] (03CR) 10Muehlenhoff: [C:04-1] "Blocked until all legacy mediawiki installations are gone." [puppet] - 10https://gerrit.wikimedia.org/r/1035631 (https://phabricator.wikimedia.org/T357750) (owner: 10Muehlenhoff) [09:43:33] (03CR) 10DCausse: [C:04-1] "up for discussion, these jobs will still run for private wikis for a couple months but the load is probably very low." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035732 (owner: 10DCausse) [09:46:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9828837 (10RickiJay-WMDE) [09:47:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2176', diff saved to https://phabricator.wikimedia.org/P63072 and previous config saved to /var/cache/conftool/dbconfig/20240524-094703-arnaudb.json [09:47:10] (03PS11) 10Stevemunene: provision datahub-next service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [09:47:10] (03PS1) 10Stevemunene: provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1035734 (https://phabricator.wikimedia.org/T363299) [09:48:02] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9828864 (10RickiJay-WMDE) I believe just a), thank you very much [09:49:52] (03CR) 10Hnowlan: [C:04-1] maps: Add option to use PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [09:53:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P63073 and previous config saved to /var/cache/conftool/dbconfig/20240524-095259-marostegui.json [09:54:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2176.codfw.wmnet with reason: Host has issues [09:54:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2176.codfw.wmnet with reason: Host has issues [09:58:07] (03CR) 10Zabe: [C:03+1] password: Document wmgPasswordSecretKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) (owner: 10Krinkle) [09:59:30] (03PS1) 10Muehlenhoff: Deprecate system::role for Blazegraph services [puppet] - 10https://gerrit.wikimedia.org/r/1035737 [10:08:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P63074 and previous config saved to /var/cache/conftool/dbconfig/20240524-100807-marostegui.json [10:08:18] (03PS1) 10Muehlenhoff: Deprecate system::role for initial set of WMCS roles [puppet] - 10https://gerrit.wikimedia.org/r/1035739 [10:12:21] (03PS3) 10Muehlenhoff: maps: Add option to use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) [10:12:38] (03CR) 10Muehlenhoff: maps: Add option to use PKI (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:12:44] (03CR) 10Aklapper: [C:03+2] Fix mangled JSON, redo export [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035546 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [10:12:51] (03CR) 10Aklapper: [V:03+2 C:03+2] Fix mangled JSON, redo export [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035546 (https://phabricator.wikimedia.org/T363188) (owner: 10Pppery) [10:15:50] (03CR) 10Muehlenhoff: [C:03+1] "Thanks! Looks good to me. Let's go ahead and merge and I'll add support to the Ganeti class to use it on a per cluster basis (and then I'l" [puppet] - 10https://gerrit.wikimedia.org/r/1023486 (https://phabricator.wikimedia.org/T309724) (owner: 10JHathaway) [10:16:35] (03CR) 10Muehlenhoff: "Thanks! I've followed up on your task, if we can actually still handle this via PQL, all the better!" [puppet] - 10https://gerrit.wikimedia.org/r/1021896 (https://phabricator.wikimedia.org/T309724) (owner: 10Muehlenhoff) [10:18:57] (03PS1) 10EoghanGaffney: lists: Remove 'hasstatus => false' from mailman service [puppet] - 10https://gerrit.wikimedia.org/r/1035740 [10:21:10] (03PS8) 10Muehlenhoff: (WIP) memcached: add extstore option [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [10:21:10] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [10:23:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T364299)', diff saved to https://phabricator.wikimedia.org/P63075 and previous config saved to /var/cache/conftool/dbconfig/20240524-102315-marostegui.json [10:23:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:23:22] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [10:23:25] (03CR) 10Muehlenhoff: [C:03+1] (WIP) memcached: add extstore option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035633 (https://phabricator.wikimedia.org/T352885) (owner: 10Effie Mouzeli) [10:23:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1187.eqiad.wmnet with reason: Maintenance [10:23:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T364299)', diff saved to https://phabricator.wikimedia.org/P63076 and previous config saved to /var/cache/conftool/dbconfig/20240524-102340-marostegui.json [10:24:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2116 to clone on db2176 T365793', diff saved to https://phabricator.wikimedia.org/P63077 and previous config saved to /var/cache/conftool/dbconfig/20240524-102424-arnaudb.json [10:24:29] T365793: db2176 crash - https://phabricator.wikimedia.org/T365793 [10:27:48] !log arnaudb@cumin1002 START - Cookbook sre.mysql.clone of db2116.codfw.wmnet onto db2176.codfw.wmnet [10:35:29] PROBLEM - MegaRAID on db2150 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:35:30] ACKNOWLEDGEMENT - MegaRAID on db2150 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T365797 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:35:39] 10ops-codfw, 06SRE, 06DC-Ops: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797 (10ops-monitoring-bot) 03NEW [10:38:56] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798 (10MoritzMuehlenhoff) 03NEW [10:39:17] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#9829018 (10MoritzMuehlenhoff) [10:39:22] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#9829019 (10MoritzMuehlenhoff) [10:39:52] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#9829020 (10MoritzMuehlenhoff) [10:39:58] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9829021 (10MoritzMuehlenhoff) [10:45:48] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:46:46] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797#9829024 (10Marostegui) [10:48:00] 06SRE, 10Acme-chief, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 06Traffic: Revert back to fleet-wide acmechief config once all ACME consumers are on Puppet 7 - https://phabricator.wikimedia.org/T365799 (10MoritzMuehlenhoff) 03NEW [10:49:25] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1035739 (owner: 10Muehlenhoff) [10:49:36] (03PS1) 10Elukey: services: upgrade tegola in codfw to use the envoy proxy for Swift [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324) [10:49:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2150, hardware issues ', diff saved to https://phabricator.wikimedia.org/P63078 and previous config saved to /var/cache/conftool/dbconfig/20240524-104953-arnaudb.json [10:50:53] (03CR) 10Elukey: "Ready for prod! If you are ok I'd do it on Monday next week :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [10:51:55] RECOVERY - MariaDB Replica Lag: s6 on db1155 is OK: OK slave_sql_lag Replication lag: 0.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:55:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db2150.codfw.wmnet with reason: reimage [10:56:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db2150.codfw.wmnet with reason: reimage [10:56:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797#9829045 (10ABran-WMF) p:05Triage→03Medium Identified SSD seem to have flapped on/off the RAID who ended up rebuilding; ` seqNum: 0x000006ad Time: Fri May 24 10:14:38 2024 Code: 0x0000010c Class: 1... [10:56:43] (03PS6) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) [10:58:40] (03CR) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240524T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240524T1100). [11:00:51] (03CR) 10Cathal Mooney: [C:03+2] Set DHCP relay for EVPN switches in codfw to 'forward-only' mode [homer/public] - 10https://gerrit.wikimedia.org/r/1035019 (https://phabricator.wikimedia.org/T365204) (owner: 10Cathal Mooney) [11:01:33] (03Merged) 10jenkins-bot: Set DHCP relay for EVPN switches in codfw to 'forward-only' mode [homer/public] - 10https://gerrit.wikimedia.org/r/1035019 (https://phabricator.wikimedia.org/T365204) (owner: 10Cathal Mooney) [11:04:05] (03CR) 10Btullis: [C:03+2] Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [11:05:39] (03Merged) 10jenkins-bot: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [11:06:43] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:06:56] FIRING: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:07:08] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [11:07:37] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [11:08:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T364299)', diff saved to https://phabricator.wikimedia.org/P63079 and previous config saved to /var/cache/conftool/dbconfig/20240524-110802-marostegui.json [11:08:07] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:09:30] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/geo-analytics: apply [11:10:02] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/geo-analytics: apply [11:10:23] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/geo-analytics: apply [11:10:51] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/geo-analytics: apply [11:11:56] RESOLVED: [2x] ProbeDown: Service gitlab2002:443 has failed probes (http_gitlab_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gitlab2002:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:13:32] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9829076 (10akosiaris) [11:14:22] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9829077 (10akosiaris) [11:14:38] (03PS6) 10Vgutierrez: lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) [11:15:30] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [11:15:51] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [11:16:16] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [11:16:34] (03CR) 10Hnowlan: [C:03+1] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1035351 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [11:17:49] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [11:18:28] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [11:18:46] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [11:19:04] (03PS7) 10Vgutierrez: lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) [11:19:15] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [11:20:44] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [11:20:53] PROBLEM - MariaDB Replica Lag: s6 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.21 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:21:00] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [11:21:01] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [11:21:26] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/media-analytics: apply [11:21:57] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/media-analytics: apply [11:22:05] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/media-analytics: apply [11:22:09] RECOVERY - MariaDB Replica Lag: s6 on clouddb1015 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:22:32] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/media-analytics: apply [11:23:08] (03CR) 10Vgutierrez: [V:03+1] "ferm rules (from ncredir1001 PCC output) look sane:" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [11:23:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P63080 and previous config saved to /var/cache/conftool/dbconfig/20240524-112310-marostegui.json [11:23:55] RECOVERY - MariaDB Replica Lag: s6 on clouddb1019 is OK: OK slave_sql_lag Replication lag: 0.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:24:33] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [11:24:59] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [11:28:57] (03CR) 10Jelto: [C:03+2] "thanks! I added a runbook now :)" [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:28:59] (03PS8) 10Vgutierrez: lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) [11:30:08] (03Merged) 10jenkins-bot: sre: add alert for trusted gitlab-runner config [alerts] - 10https://gerrit.wikimedia.org/r/1035370 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:33:00] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [11:33:37] FIRING: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:35:46] * akosiaris looking ^ [11:38:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P63081 and previous config saved to /var/cache/conftool/dbconfig/20240524-113820-marostegui.json [11:43:22] RESOLVED: SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [11:44:04] !log manually delete the 1 sessionstore pod running on parse1004 [11:44:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:46:20] (03CR) 10Effie Mouzeli: [C:03+1] "thank you very much for working on this, give me a shout on monday so I can be around when we deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035743 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [11:46:23] (03CR) 10Ladsgroup: [C:03+1] "It was a copy-paste error from mailman2:" [puppet] - 10https://gerrit.wikimedia.org/r/1035740 (owner: 10EoghanGaffney) [11:48:40] (03PS1) 10Muehlenhoff: pws: Stop using deprecated .exists method (removed in 3.2) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035745 [11:49:05] (03PS1) 10Jelto: gitlab: fix collaboration-services team name [alerts] - 10https://gerrit.wikimedia.org/r/1035746 [11:51:06] 06SRE, 06serviceops, 13Patch-For-Review: Modernise memcached systemd unit / sync, and make it presentable - https://phabricator.wikimedia.org/T273950#9829123 (10jijiki) 05Open→03In progress a:03jijiki [11:52:24] (03PS1) 10Marostegui: db2150: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035747 (https://phabricator.wikimedia.org/T365797) [11:53:01] (03PS2) 10Jelto: gitlab: fix collaboration-services team name [alerts] - 10https://gerrit.wikimedia.org/r/1035746 (https://phabricator.wikimedia.org/T354656) [11:53:06] (03CR) 10Marostegui: [C:03+2] db2150: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1035747 (https://phabricator.wikimedia.org/T365797) (owner: 10Marostegui) [11:53:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T364299)', diff saved to https://phabricator.wikimedia.org/P63082 and previous config saved to /var/cache/conftool/dbconfig/20240524-115328-marostegui.json [11:53:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:53:33] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:53:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797#9829143 (10Marostegui) let's not replace the disk for now, the RAID got back to optimal. Let's give it the weekend and see what happens. [11:53:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1201.eqiad.wmnet with reason: Maintenance [11:53:47] (03CR) 10EoghanGaffney: [C:03+1] gitlab: fix collaboration-services team name [alerts] - 10https://gerrit.wikimedia.org/r/1035746 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:53:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T364299)', diff saved to https://phabricator.wikimedia.org/P63083 and previous config saved to /var/cache/conftool/dbconfig/20240524-115351-marostegui.json [11:55:11] RECOVERY - MariaDB Replica Lag: s6 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:55:23] (03CR) 10Jelto: [C:03+2] gitlab: fix collaboration-services team name [alerts] - 10https://gerrit.wikimedia.org/r/1035746 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:55:29] RECOVERY - MegaRAID on db2150 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [11:55:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops, 13Patch-For-Review: Degraded RAID on db2150 - https://phabricator.wikimedia.org/T365797#9829149 (10Marostegui) ` [13:55:29] <+icinga-wm_> RECOVERY - MegaRAID on db2150 is OK: OK: optimal, 1 logical, 10 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/Me... [11:56:33] (03CR) 10EoghanGaffney: [C:03+2] lists: Remove 'hasstatus => false' from mailman service [puppet] - 10https://gerrit.wikimedia.org/r/1035740 (owner: 10EoghanGaffney) [11:56:37] (03Merged) 10jenkins-bot: gitlab: fix collaboration-services team name [alerts] - 10https://gerrit.wikimedia.org/r/1035746 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [11:59:24] (03PS1) 10NMW03: Set $wgCategoryCollation to uca-bs-u-kn on Bosnian Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034941 (https://phabricator.wikimedia.org/T365133) [12:06:52] FIRING: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [12:07:19] ^ this is a test and expected [12:07:39] RECOVERY - MariaDB Replica Lag: s1 on db2176 is OK: OK slave_sql_lag Replication lag: 52.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:08:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2116.codfw.wmnet onto db2176.codfw.wmnet [12:11:52] RESOLVED: GitLabRunnerTrustedConfigMissing: Trusted gitlab-runner missing config - https://wikitech.wikimedia.org/wiki/GitLab/Runbook#GitLabRunnerTrustedConfigMissing - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabRunnerTrustedConfigMissing [12:14:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9829202 (10jon_amar-WMDE) Hi @Dzahn I'm not clear whether I can provide approval (I'm the Product Manager for Wikibase Suite) or whether doing so in a comment is sufficient. I... [12:14:55] RECOVERY - MariaDB Replica Lag: s6 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.03 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:14:59] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [12:15:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: post clone (src) repool', diff saved to https://phabricator.wikimedia.org/P63084 and previous config saved to /var/cache/conftool/dbconfig/20240524-121523-arnaudb.json [12:15:33] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [12:15:41] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [12:16:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:16:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [12:16:19] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [12:16:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'fix wrong weight', diff saved to https://phabricator.wikimedia.org/P63085 and previous config saved to /var/cache/conftool/dbconfig/20240524-121641-arnaudb.json [12:17:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 25%: post clone (src) repool', diff saved to https://phabricator.wikimedia.org/P63086 and previous config saved to /var/cache/conftool/dbconfig/20240524-121659-arnaudb.json [12:17:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 1%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63087 and previous config saved to /var/cache/conftool/dbconfig/20240524-121715-arnaudb.json [12:18:14] (03CR) 10TChin: [C:03+2] datasets-config: Add volume for configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:19:12] (03Merged) 10jenkins-bot: datasets-config: Add volume for configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:19:36] (03PS1) 10Pmiazga: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) [12:19:45] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 4 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9829237 (10Gehel) [12:20:00] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [12:20:35] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [12:21:46] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9829264 (10Gehel) [12:22:50] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/edit-analytics: apply [12:23:09] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/edit-analytics: apply [12:23:17] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/edit-analytics: apply [12:23:39] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/edit-analytics: apply [12:24:57] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [12:25:07] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [12:27:29] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/editor-analytics: apply [12:27:43] (03PS1) 10Muehlenhoff: maps::tlsproxy: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1035750 [12:27:48] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/editor-analytics: apply [12:28:00] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/editor-analytics: apply [12:28:21] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/editor-analytics: apply [12:28:32] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035750 (owner: 10Muehlenhoff) [12:30:43] (03PS1) 10TChin: datasets-config: Move prometheus port to right place [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) [12:32:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 50%: post clone (src) repool', diff saved to https://phabricator.wikimedia.org/P63088 and previous config saved to /var/cache/conftool/dbconfig/20240524-123205-arnaudb.json [12:32:07] (03PS2) 10TChin: datasets-config: Move prometheus port to right place [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) [12:32:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 2%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63089 and previous config saved to /var/cache/conftool/dbconfig/20240524-123221-arnaudb.json [12:32:42] (03PS1) 10Pmiazga: [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) [12:34:48] (03PS1) 10TChin: datasets-config: Bump helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035753 (https://phabricator.wikimedia.org/T357434) [12:36:09] (03PS3) 10TChin: datasets-config: Move prometheus port to right place [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) [12:36:29] (03PS2) 10Muehlenhoff: openstack::base Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) [12:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:36:52] (03PS30) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [12:37:59] (03PS4) 10TChin: datasets-config: Move prometheus port to right place [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) [12:38:47] (03CR) 10Ottomata: "Thanks Timo!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [12:39:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T364299)', diff saved to https://phabricator.wikimedia.org/P63090 and previous config saved to /var/cache/conftool/dbconfig/20240524-123925-marostegui.json [12:42:42] (03CR) 10Pmiazga: [C:04-1] "per Ariel point that I messed up test2.wikimedia and test2.wikipedia in host banes" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [12:43:47] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [12:43:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9829374 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye [12:45:02] (03PS2) 10Pmiazga: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) [12:45:38] (03CR) 10CI reject: [V:04-1] beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [12:46:27] (03CR) 10Ottomata: [C:03+1] datasets-config: Move prometheus port to right place (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:46:39] (03CR) 10Ottomata: [C:03+1] datasets-config: Bump helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035753 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:47:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 75%: post clone (src) repool', diff saved to https://phabricator.wikimedia.org/P63091 and previous config saved to /var/cache/conftool/dbconfig/20240524-124711-arnaudb.json [12:47:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 5%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63092 and previous config saved to /var/cache/conftool/dbconfig/20240524-124727-arnaudb.json [12:48:25] (03PS3) 10Pmiazga: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) [12:48:26] (03CR) 10Pmiazga: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [12:49:30] (03PS4) 10Pmiazga: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) [12:52:51] (03PS1) 10CDanis: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035756 (https://phabricator.wikimedia.org/T363407) [12:53:00] !log btullis@cumin1002 START - Cookbook sre.hosts.decommission for hosts snapshot1008.eqiad.wmnet [12:54:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P63093 and previous config saved to /var/cache/conftool/dbconfig/20240524-125433-marostegui.json [12:55:14] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] pws: Stop using deprecated .exists method (removed in 3.2) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035745 (owner: 10Muehlenhoff) [12:56:30] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9829434 (10Jclark-ctr) I was able to correct kafka-main1010 issue for dhcp but image fails still {F54287578} @akosiaris did you have this issue with other s... [12:57:46] (03PS1) 10Majavah: wikilabels::session: Set now-required memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1035762 [12:58:17] (03PS1) 10Btullis: Remove remaining references to snapshot1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) [12:58:19] (03PS1) 10Muehlenhoff: Remove obsolete Provides/Replaces/Conflicts for wmf-sre-laptop [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035764 [12:58:27] (03PS1) 10Slyngshede: Always require users to pick a system for SSH keys. [software/bitu] - 10https://gerrit.wikimedia.org/r/1035765 [12:59:07] (03PS5) 10Elukey: amd-pytorch: refactor the common bits to DRY the Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 [12:59:27] (03PS1) 10Muehlenhoff: Remove obsolete Conflicts to pwstore [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035766 [13:00:09] (03PS1) 10Muehlenhoff: Unversion git-review dep [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035767 [13:00:10] (03PS1) 10Btullis: Remove snapshot1008 from the dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/1035768 (https://phabricator.wikimedia.org/T364455) [13:00:19] (03CR) 10Elukey: "== Step 0: scanning /home/elukey/Wikimedia/production-images/images/ ==" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 (owner: 10Elukey) [13:00:34] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete Provides/Replaces/Conflicts for wmf-sre-laptop [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035764 (owner: 10Muehlenhoff) [13:00:55] (03PS1) 10Alexandros Kosiaris: preseed: Add kafka-main to first time seeding [puppet] - 10https://gerrit.wikimedia.org/r/1035769 (https://phabricator.wikimedia.org/T363212) [13:01:18] (03CR) 10Elukey: [V:03+2 C:03+2] amd-pytorch: refactor the common bits to DRY the Dockerfiles [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1035441 (owner: 10Elukey) [13:01:34] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] preseed: Add kafka-main to first time seeding [puppet] - 10https://gerrit.wikimedia.org/r/1035769 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [13:01:57] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete Conflicts to pwstore [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035766 (owner: 10Muehlenhoff) [13:02:07] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Unversion git-review dep [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035767 (owner: 10Muehlenhoff) [13:02:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2116 (re)pooling @ 100%: post clone (src) repool', diff saved to https://phabricator.wikimedia.org/P63094 and previous config saved to /var/cache/conftool/dbconfig/20240524-130217-arnaudb.json [13:02:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 10%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63095 and previous config saved to /var/cache/conftool/dbconfig/20240524-130233-arnaudb.json [13:04:05] (03Abandoned) 10Bking: elasticsearch: enable CPU performance governor [puppet] - 10https://gerrit.wikimedia.org/r/1035534 (https://phabricator.wikimedia.org/T362922) (owner: 10Bking) [13:04:47] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9829487 (10akosiaris) >>! In T363212#9829434, @Jclark-ctr wrote: > I was able to correct kafka-main1010 issue for dhcp but image fails... [13:05:34] !log btullis@cumin1002 START - Cookbook sre.dns.netbox [13:07:47] (03CR) 10TChin: [C:03+2] datasets-config: Bump helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035753 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:08:18] (03PS1) 10Muehlenhoff: Add debian/copyright [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035770 [13:08:18] (03PS1) 10Muehlenhoff: Depend on sensible-utils (used by pws) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035771 [13:08:18] (03PS1) 10Muehlenhoff: Add dependency on python3 for wmf-update-ssh-config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035772 [13:08:37] (03Merged) 10jenkins-bot: datasets-config: Bump helmfile image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035753 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:09:20] (03Abandoned) 10CDanis: WIP: jaeger: include oauth config in Deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005546 (https://phabricator.wikimedia.org/T358111) (owner: 10CDanis) [13:09:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P63096 and previous config saved to /var/cache/conftool/dbconfig/20240524-130942-marostegui.json [13:10:11] (03PS1) 10CDanis: Deploy otel collector to staging clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035773 (https://phabricator.wikimedia.org/T365809) [13:17:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63097 and previous config saved to /var/cache/conftool/dbconfig/20240524-131739-arnaudb.json [13:18:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1035765 (owner: 10Slyngshede) [13:19:19] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add debian/copyright [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035770 (owner: 10Muehlenhoff) [13:19:31] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Depend on sensible-utils (used by pws) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035771 (owner: 10Muehlenhoff) [13:19:43] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add dependency on python3 for wmf-update-ssh-config [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035772 (owner: 10Muehlenhoff) [13:21:52] (03PS1) 10EoghanGaffney: lists: Don't try to remove the mtail user when monitoring is absent [puppet] - 10https://gerrit.wikimedia.org/r/1035777 (https://phabricator.wikimedia.org/T331706) [13:23:39] (03CR) 10Majavah: "see inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:24:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T364299)', diff saved to https://phabricator.wikimedia.org/P63098 and previous config saved to /var/cache/conftool/dbconfig/20240524-132450-marostegui.json [13:24:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1224.eqiad.wmnet with reason: Maintenance [13:24:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:24:58] (03PS5) 10TChin: datasets-config: Move prometheus port to right place [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) [13:25:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1224.eqiad.wmnet with reason: Maintenance [13:25:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T364299)', diff saved to https://phabricator.wikimedia.org/P63099 and previous config saved to /var/cache/conftool/dbconfig/20240524-132514-marostegui.json [13:26:19] (03CR) 10TChin: [C:03+2] datasets-config: Move prometheus port to right place (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:26:50] (03PS1) 10Muehlenhoff: Add debian/source/format [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035778 [13:26:50] (03PS1) 10Muehlenhoff: Add ${misc:Depends} to dependencies [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035779 [13:27:14] (03Merged) 10jenkins-bot: datasets-config: Move prometheus port to right place [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035751 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:29:36] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add debian/source/format [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035778 (owner: 10Muehlenhoff) [13:29:43] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add ${misc:Depends} to dependencies [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035779 (owner: 10Muehlenhoff) [13:31:14] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/1035780 [13:32:39] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:32:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63100 and previous config saved to /var/cache/conftool/dbconfig/20240524-133245-arnaudb.json [13:34:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035777 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [13:34:55] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: snapshot1008.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [13:34:57] (03CR) 10EoghanGaffney: [C:03+2] lists: Don't try to remove the mtail user when monitoring is absent [puppet] - 10https://gerrit.wikimedia.org/r/1035777 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [13:35:48] (03PS61) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:35:48] (03CR) 10Muehlenhoff: "There's also hieradata/hosts/snapshot1008.yaml still" [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [13:36:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: snapshot1008.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - btullis@cumin1002" [13:36:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:36:03] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts snapshot1008.eqiad.wmnet [13:36:12] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9829671 (10cscott) I've created the following five feature requests as strawdog proposals to address each of... [13:36:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:36:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:37:47] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: decommission snapshot1008.eqiad.wmnet - https://phabricator.wikimedia.org/T364455#9829672 (10BTullis) a:05BTullis→03None [13:37:50] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: decommission snapshot1008.eqiad.wmnet - https://phabricator.wikimedia.org/T364455#9829676 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by btullis@cumin1002 for hosts: `sn... [13:37:50] (03CR) 10Filippo Giunchedi: [C:03+1] Deploy otel collector to staging clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035773 (https://phabricator.wikimedia.org/T365809) (owner: 10CDanis) [13:38:16] (03CR) 10DCausse: [C:03+1] Deprecate system::role for Blazegraph services [puppet] - 10https://gerrit.wikimedia.org/r/1035737 (owner: 10Muehlenhoff) [13:38:52] (03PS3) 10Muehlenhoff: openstack::base Add SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) [13:38:57] (03PS2) 10Btullis: Remove remaining references to snapshot1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) [13:39:04] (03CR) 10DCausse: wdqs: extract categories reload to its own cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [13:39:16] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [13:40:35] (03CR) 10Btullis: "Many thanks. Done." [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [13:40:41] (03CR) 10Btullis: [C:03+2] Remove remaining references to snapshot1008 [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [13:41:23] (03CR) 10Muehlenhoff: openstack::base Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:41:47] (03CR) 10CI reject: [V:04-1] mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:43:08] (03CR) 10Btullis: [V:03+2 C:03+2] Remove snapshot1008 from the dumps scap targets [dumps/scap] - 10https://gerrit.wikimedia.org/r/1035768 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [13:45:16] 10ops-eqiad, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.05.27 - 2024.06.16), 13Patch-For-Review: decommission snapshot1008.eqiad.wmnet - https://phabricator.wikimedia.org/T364455#9829692 (10BTullis) [13:46:03] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9829693 (10BTullis) [13:46:33] (03PS62) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:47:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63101 and previous config saved to /var/cache/conftool/dbconfig/20240524-134752-arnaudb.json [13:52:25] (03PS63) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:52:39] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [13:53:37] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9829717 (10stjn) All of those proposals are fundamentally worse (and probably will be more inaccessible) fro... [13:57:04] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["76882704"]' 2>&1 | tee -a ~/T315510-enwiki-6; date [13:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:50] (03PS2) 10Hashar: wm-pcc: add a run action [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1032855 (https://phabricator.wikimedia.org/T363918) [13:57:52] (03PS2) 10Marostegui: mariadb: Promote db1192 to master [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) [13:57:53] (03CR) 10Hashar: [C:03+2] wm-pcc: add a run action [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1032855 (https://phabricator.wikimedia.org/T363918) (owner: 10Hashar) [13:58:38] (03CR) 10Majavah: [C:03+1] openstack::base Add SPDX headers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [13:58:45] (03Merged) 10jenkins-bot: wm-pcc: add a run action [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1032855 (https://phabricator.wikimedia.org/T363918) (owner: 10Hashar) [13:59:30] !log hashar@deploy1002 Started deploy [gerrit/gerrit@af1257f]: wm-pcc: add a run action - T363918 [13:59:34] T363918: Gerrit recheck button - https://phabricator.wikimedia.org/T363918 [13:59:37] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@af1257f]: wm-pcc: add a run action - T363918 (duration: 00m 07s) [14:02:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: post clone (dst) repool', diff saved to https://phabricator.wikimedia.org/P63102 and previous config saved to /var/cache/conftool/dbconfig/20240524-140258-arnaudb.json [14:06:05] (03CR) 10Btullis: [WIP] Initial import of ceph-csi-rbd chart for inspection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028931 (https://phabricator.wikimedia.org/T364472) (owner: 10Btullis) [14:06:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T364299)', diff saved to https://phabricator.wikimedia.org/P63103 and previous config saved to /var/cache/conftool/dbconfig/20240524-140614-marostegui.json [14:06:20] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:10:01] (03CR) 10Marostegui: [C:04-2] "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035315 (https://phabricator.wikimedia.org/T364541) (owner: 10Marostegui) [14:14:26] (03PS1) 10EoghanGaffney: lists: Add lists2001/lists1004 as allowed hosts for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/1035785 (https://phabricator.wikimedia.org/T331706) [14:17:29] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1035785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:18:03] (03CR) 10EoghanGaffney: [C:03+2] lists: Add lists2001/lists1004 as allowed hosts for acmechief [puppet] - 10https://gerrit.wikimedia.org/r/1035785 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:19:31] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9829758 (10cscott) Please comment on the specific tasks, your concerns are addressed there. But also feel f... [14:21:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P63104 and previous config saved to /var/cache/conftool/dbconfig/20240524-142122-marostegui.json [14:27:56] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9829797 (10cmooney) 05Open→03Resolved Change has been pushed out in codfw where we have the issue. Closing this one for now... [14:36:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P63105 and previous config saved to /var/cache/conftool/dbconfig/20240524-143630-marostegui.json [14:36:47] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:47:27] (03CR) 10Herron: [C:03+2] trafficserver: point pyrra to thanos discovery record [puppet] - 10https://gerrit.wikimedia.org/r/1035541 (https://phabricator.wikimedia.org/T356386) (owner: 10Herron) [14:51:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T364299)', diff saved to https://phabricator.wikimedia.org/P63106 and previous config saved to /var/cache/conftool/dbconfig/20240524-145139-marostegui.json [14:51:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:51:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:51:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:51:49] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [14:51:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [14:56:47] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:58:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:59:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:59:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63107 and previous config saved to /var/cache/conftool/dbconfig/20240524-145912-marostegui.json [14:59:18] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:59:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): decommission snapshot1008.eqiad.wmnet - https://phabricator.wikimedia.org/T364455#9829912 (10VRiley-WMF) a:03VRiley-WMF [15:01:10] (03CR) 10Pmiazga: [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [15:01:55] (03PS1) 10EoghanGaffney: lists: Don't include the lists::automation class in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 [15:13:58] (03PS1) 10Elukey: redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 [15:14:08] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035789 (owner: 10EoghanGaffney) [15:14:43] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Shutdown of Puppet 5 servers - https://phabricator.wikimedia.org/T365798#9829957 (10MoritzMuehlenhoff) [15:14:44] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): Review/cleanup content of /srv/private/modules/secret/secrets/ssl in the private repo - https://phabricator.wikimedia.org/T364622#9829958 (10MoritzMuehlenhoff) [15:18:11] !log FINISHED lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["76882704"]' 2>&1 | tee -a ~/T315510-enwiki-6; date # a few minutes ago [15:18:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:59] (03PS7) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [15:20:02] (03CR) 10CI reject: [V:04-1] redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 (owner: 10Elukey) [15:24:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:24:40] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:25:24] (03CR) 10Scott French: "Now that Ben's https://gerrit.wikimedia.org/r/1033405 is merged, I've updated the new global-staging values to similarly use external-serv" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) (owner: 10Scott French) [15:27:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:28:20] (03PS2) 10Elukey: redfish: fix typo in DellSCP's class descr [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 [15:30:50] 06SRE, 10SRE-Access-Requests: Requesting access to for - https://phabricator.wikimedia.org/T365832 (10RAdimer-WMF) 03NEW [15:31:25] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9830041 (10RAdimer-WMF) [15:32:15] RESOLVED: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:35:00] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034943 [15:35:29] (03CR) 10Xcollazo: Remove remaining references to snapshot1008 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035763 (https://phabricator.wikimedia.org/T364455) (owner: 10Btullis) [15:39:39] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 10Data-Platform-SRE (2024.05.27 - 2024.06.16): decommission snapshot1008.eqiad.wmnet - https://phabricator.wikimedia.org/T364455#9830052 (10VRiley-WMF) 05Open→03Resolved [15:40:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:40:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:40:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:41:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:41:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63109 and previous config saved to /var/cache/conftool/dbconfig/20240524-154108-marostegui.json [15:41:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [15:42:18] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9830059 (10elappen-WMF) Approving access from my end. [15:51:50] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [15:53:01] (03CR) 10BCornwall: [C:03+1] service: Remove probes for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/1035543 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [15:53:19] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830093 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [15:53:23] (03PS2) 10JHathaway: ganeti: function to generate ganeti known hosts [puppet] - 10https://gerrit.wikimedia.org/r/1023486 (https://phabricator.wikimedia.org/T309724) [15:53:31] (03CR) 10JHathaway: [C:03+2] "sounds good, merged" [puppet] - 10https://gerrit.wikimedia.org/r/1023486 (https://phabricator.wikimedia.org/T309724) (owner: 10JHathaway) [16:00:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:14:32] (03PS2) 10EoghanGaffney: lists: Don't include automation in standby hosts [puppet] - 10https://gerrit.wikimedia.org/r/1035789 [16:16:02] (03CR) 10BCornwall: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:17:35] PROBLEM - Disk space on backup1007 is CRITICAL: DISK CRITICAL - free space: /srv/objectstorage 6197226 MB (3% inode=96%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1007&var-datasource=eqiad+prometheus/ops [16:19:08] (03PS1) 10Dduvall: service: Remove blubberoid from service catalog and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1035797 (https://phabricator.wikimedia.org/T365742) [16:19:10] (03PS1) 10Dduvall: service: Remove remaining blubberoid related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1035798 (https://phabricator.wikimedia.org/T365742) [16:26:30] FIRING: ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:46] (03CR) 10ArielGlenn: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [16:30:09] (03CR) 10Reedy: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [16:31:30] RESOLVED: ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:32:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T364069)', diff saved to https://phabricator.wikimedia.org/P63111 and previous config saved to /var/cache/conftool/dbconfig/20240524-163245-marostegui.json [16:32:50] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:33:51] 06SRE, 10LDAP-Access-Requests: Grant Access to analytics-privatedata-users, wmf for Sonja Perry - https://phabricator.wikimedia.org/T365766#9830239 (10MMiller_WMF) Hello -- Sonja needs to be in the groups that give her access to Superset and Turnilo, but not more tools beyond that. Reading the page you linked... [16:35:56] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1041.eqiad.wmnet with OS bookworm [16:36:15] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [16:36:27] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [16:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:40:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:41:35] (03CR) 10BCornwall: [C:03+1] service: Remove blubberoid from backend servers and load balancers [puppet] - 10https://gerrit.wikimedia.org/r/1035589 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:43:08] (03CR) 10BCornwall: [C:03+1] service: Remove blubberoid from service catalog and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1035797 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:43:52] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9830287 (10Soda) Going into the original question, is there a reason why the law book isn't broken up into p... [16:44:38] (03CR) 10BCornwall: [C:03+1] service: Remove remaining blubberoid related configuration [puppet] - 10https://gerrit.wikimedia.org/r/1035798 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [16:44:54] (03CR) 10Krinkle: beta: Introduce new test2wiki on test.wikipedia.beta.wmcloud.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035749 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [16:46:16] (03PS7) 10Reedy: interwiki.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) [16:46:16] (03PS1) 10Reedy: interwiki.php: Update per onwiki changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035802 [16:46:44] (03CR) 10Reedy: "New parent introduced to make this cleaner due to changes on metawiki page" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy) [16:47:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P63112 and previous config saved to /var/cache/conftool/dbconfig/20240524-164753-marostegui.json [16:51:04] (03CR) 10Volans: [C:03+1] "LGTM" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1035791 (owner: 10Elukey) [16:57:31] (03PS2) 10Reedy: interwiki(-labs)?.php: Update per onwiki changes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035802 [16:57:31] (03PS8) 10Reedy: interwiki.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035389 (https://phabricator.wikimedia.org/T365679) [16:57:45] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1041.eqiad.wmnet with OS bookworm [16:57:52] (03CR) 10Reedy: "Dupe was removed on meta, so https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1035802/2?usp=related-change takes care of thi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 (owner: 10Reedy) [16:57:54] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830380 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:... [16:57:58] (03Abandoned) 10Reedy: interwiki-labs.php: Remove duplicates [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035421 (owner: 10Reedy) [16:58:22] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [16:58:36] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830383 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [16:59:08] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9830388 (10Soda) I will note **Exporting Issues** should be solved by the Wikisource extension which adds a... [17:02:32] (03PS2) 10Bking: dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) [17:03:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P63113 and previous config saved to /var/cache/conftool/dbconfig/20240524-170301-marostegui.json [17:05:04] (03CR) 10Pintoch: [C:03+1] "The `redis.conf` looks good to me. Snapshotting is disabled and I think that makes sense for a lot of applications. Thanks for working on " [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1012797 (https://phabricator.wikimedia.org/T360378) (owner: 10BryanDavis) [17:27:52] (03PS3) 10CDanis: WIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035756 (https://phabricator.wikimedia.org/T363407) [17:30:11] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:30:21] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:34:58] !log andrew@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1041.eqiad.wmnet with OS bookworm [17:35:05] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830591 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm [17:37:43] FIRING: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:39:03] (03CR) 10Volans: [C:03+1] "LGTM, thanks for the fixes! Reply inline for the choice of the stat host." [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [17:40:18] (03PS1) 10Pppery: Add link for Arcanist message documentation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035805 (https://phabricator.wikimedia.org/T351581) [17:41:30] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830617 (10Andrew) The main point of suspicion here is that it doesn't autoconfig the network but asks me to specify a netmask. Cathal suggests that switching the nic fir... [17:41:34] (03CR) 10Volans: [C:03+1] "LGTM (as before, leaving the fine-details to your team). Same for the next CR in the chain." [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [17:41:44] !log andrew@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host cloudvirt1041.eqiad.wmnet with OS bookworm [17:41:52] 10ops-eqiad, 06SRE, 06cloud-services-team, 06DC-Ops: cloudvirt1041: can't boot after reimage - https://phabricator.wikimedia.org/T364984#9830621 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin1002 for host cloudvirt1041.eqiad.wmnet with OS bookworm executed with errors:... [17:44:56] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:45:05] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:45:40] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [17:45:45] 06SRE, 10SRE-tools, 07SRE-Unowned, 06Infrastructure-Foundations: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9830663 (10Volans) Is for anyone that wants to write this cookbook. [17:47:43] RESOLVED: OtelCollectorRefusedSpans: Some spans have been refused by receiver otlp on k8s - TODO - https://grafana.wikimedia.org/d/SPebYW7Iz/opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DOtelCollectorRefusedSpans [17:50:01] FIRING: [2x] CertAlmostExpired: Certificate for service ml-staging-ctrl2002:6443 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#ml-staging-ctrl2002:6443 - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:50:31] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [17:54:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63116 and previous config saved to /var/cache/conftool/dbconfig/20240524-175421-marostegui.json [17:54:26] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:05:01] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [18:09:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P63117 and previous config saved to /var/cache/conftool/dbconfig/20240524-180929-marostegui.json [18:12:20] (03CR) 10BCornwall: [C:03+2] service: Remove probes for blubberoid [puppet] - 10https://gerrit.wikimedia.org/r/1035543 (https://phabricator.wikimedia.org/T365742) (owner: 10Dduvall) [18:19:31] (03PS4) 10CDanis: otelcol: deploy k8s attributes processor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035756 (https://phabricator.wikimedia.org/T363407) [18:21:58] (03PS5) 10CDanis: otelcol: deploy k8sattributes processor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035756 (https://phabricator.wikimedia.org/T363407) [18:24:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P63118 and previous config saved to /var/cache/conftool/dbconfig/20240524-182437-marostegui.json [18:28:17] (03CR) 10CDanis: [C:03+2] "Very happy to open followup patches for any comments or concerns" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035756 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [18:31:00] (03Merged) 10jenkins-bot: otelcol: deploy k8sattributes processor [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035756 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [18:32:12] (03PS2) 10CDobbins: purged: set use_pki to true for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) [18:32:55] (03PS3) 10CDobbins: purged: set use_pki to true for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) [18:33:41] 06SRE, 10MoveComms-Support, 10MW-on-K8s, 06serviceops, and 2 others: Move 100% of external traffic to Kubernetes (excluding Votewiki) - https://phabricator.wikimedia.org/T362323#9830829 (10Jdforrester-WMF) Note that the votewiki blocker is apparently also now fixed. [18:33:54] (03CR) 10CDobbins: purged: set use_pki to true for drmrs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:35:14] (03CR) 10Ssingh: purged: set use_pki to true for drmrs (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [18:36:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1007-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [18:39:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T364299)', diff saved to https://phabricator.wikimedia.org/P63120 and previous config saved to /var/cache/conftool/dbconfig/20240524-183945-marostegui.json [18:39:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [18:39:51] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:40:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1162.eqiad.wmnet with reason: Maintenance [18:40:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T364299)', diff saved to https://phabricator.wikimedia.org/P63121 and previous config saved to /var/cache/conftool/dbconfig/20240524-184009-marostegui.json [18:45:00] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [18:45:09] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [18:56:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:04:26] (03PS1) 10CDanis: otelcol: use k8s attributes for service names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035811 (https://phabricator.wikimedia.org/T363407) [19:07:18] (03CR) 10CDanis: [C:03+2] "As before happy to take on followups, but this is deployed and working" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035811 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:09:37] (03Merged) 10jenkins-bot: otelcol: use k8s attributes for service names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035811 (https://phabricator.wikimedia.org/T363407) (owner: 10CDanis) [19:14:18] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9831069 (10VRiley-WMF) Dell sent the motherboard out for kafka-main1009. After replacing the motherboard, it still continues to throw the same error as it did... [19:19:24] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:24:13] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:26:02] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [19:26:20] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply [19:29:59] PROBLEM - Disk space on stat1008 is CRITICAL: DISK CRITICAL - free space: /srv 250730 MB (3% inode=90%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=stat1008&var-datasource=eqiad+prometheus/ops [19:31:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:32:00] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [19:32:10] !log tchin@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [19:44:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T364069)', diff saved to https://phabricator.wikimedia.org/P63122 and previous config saved to /var/cache/conftool/dbconfig/20240524-194450-marostegui.json [19:44:55] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:46:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T364299)', diff saved to https://phabricator.wikimedia.org/P63123 and previous config saved to /var/cache/conftool/dbconfig/20240524-194655-marostegui.json [19:47:03] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:50:59] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:51:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.267 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:59:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P63124 and previous config saved to /var/cache/conftool/dbconfig/20240524-195958-marostegui.json [20:02:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P63125 and previous config saved to /var/cache/conftool/dbconfig/20240524-200203-marostegui.json [20:04:08] (03CR) 10Krinkle: [C:03+1] [beta] Add test2.wikimedia.beta.wmcloud.org to beta_sites [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [20:04:51] (03CR) 10Krinkle: [C:03+1] "Remember to ask for SRE support to merge in this repo (I/we can't)." [puppet] - 10https://gerrit.wikimedia.org/r/1035752 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [20:15:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P63126 and previous config saved to /var/cache/conftool/dbconfig/20240524-201506-marostegui.json [20:17:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162', diff saved to https://phabricator.wikimedia.org/P63127 and previous config saved to /var/cache/conftool/dbconfig/20240524-201711-marostegui.json [20:27:30] (03PS1) 10Brennen Bearnes: WIP: single user for gitlab-settings; timer for configure-projects [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) [20:30:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T364069)', diff saved to https://phabricator.wikimedia.org/P63128 and previous config saved to /var/cache/conftool/dbconfig/20240524-203014-marostegui.json [20:30:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [20:30:19] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [20:30:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [20:30:33] (03PS4) 10CDobbins: purged: set use_pki to true for drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1035538 (https://phabricator.wikimedia.org/T360506) [20:30:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T364069)', diff saved to https://phabricator.wikimedia.org/P63129 and previous config saved to /var/cache/conftool/dbconfig/20240524-203037-marostegui.json [20:31:56] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:32:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1162 (T364299)', diff saved to https://phabricator.wikimedia.org/P63130 and previous config saved to /var/cache/conftool/dbconfig/20240524-203219-marostegui.json [20:32:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:32:24] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [20:32:33] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:32:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [20:32:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63131 and previous config saved to /var/cache/conftool/dbconfig/20240524-203243-marostegui.json [20:36:08] (03PS1) 10Pppery: Ignore /src/.cache as well [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1035822 [20:36:47] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:36:48] (03CR) 10Dr0ptp4kt: "Generally looks okay - understood you'll want to step through it and debug any steps with someone with ops permissions." [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [20:47:22] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:47:56] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:52:44] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:53:15] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:53:29] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [20:54:01] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [21:26:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:33:06] (03PS3) 10Bking: dse-k8s: add airflow-analytics-test namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) [21:39:19] (03CR) 10Bking: dse-k8s: add airflow-analytics-test namespace (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [22:05:16] FIRING: [4x] CertAlmostExpired: Certificate for service ml-staging-ctrl2001:6443 is about to expire - https://grafana.wikimedia.org/d/K1dRhGCnz/probes-tls-dashboard - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [22:21:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:24:23] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:24:30] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:26:39] RESOLVED: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:33:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:34:23] RESOLVED: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63132 and previous config saved to /var/cache/conftool/dbconfig/20240524-223921-marostegui.json [22:39:26] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:54:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P63133 and previous config saved to /var/cache/conftool/dbconfig/20240524-225428-marostegui.json [22:58:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T364069)', diff saved to https://phabricator.wikimedia.org/P63134 and previous config saved to /var/cache/conftool/dbconfig/20240524-225846-marostegui.json [22:58:51] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:09:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P63135 and previous config saved to /var/cache/conftool/dbconfig/20240524-230937-marostegui.json [23:13:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P63136 and previous config saved to /var/cache/conftool/dbconfig/20240524-231354-marostegui.json [23:16:45] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:16:50] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:24:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T364299)', diff saved to https://phabricator.wikimedia.org/P63137 and previous config saved to /var/cache/conftool/dbconfig/20240524-232445-marostegui.json [23:24:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [23:24:50] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:25:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [23:25:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T364299)', diff saved to https://phabricator.wikimedia.org/P63138 and previous config saved to /var/cache/conftool/dbconfig/20240524-232508-marostegui.json [23:29:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P63139 and previous config saved to /var/cache/conftool/dbconfig/20240524-232902-marostegui.json [23:31:13] !log ebernhardson@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:31:20] !log ebernhardson@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:31] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034944 [23:38:31] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034944 (owner: 10TrainBranchBot) [23:44:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T364069)', diff saved to https://phabricator.wikimedia.org/P63140 and previous config saved to /var/cache/conftool/dbconfig/20240524-234410-marostegui.json [23:44:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [23:44:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [23:44:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [23:44:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T364069)', diff saved to https://phabricator.wikimedia.org/P63141 and previous config saved to /var/cache/conftool/dbconfig/20240524-234433-marostegui.json