[00:02:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint2001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [00:10:47] Hello, CI is stuck again. [00:11:01] coverage, codehealth and others. [00:17:35] 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T325652 (10phaultfinder) [00:24:07] (03CR) 10Thcipriani: [C: 03+1] admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [00:27:18] Kizule: I definitely see zuul enqueing jobs. The code coverage and codehealth pipelines have a much lower priority than the test and merger pipelines (since they all rely on the same pool of runners). Those pipelines get backed up when there are a lot of tests running for other pipelines. [00:27:47] tl;dr: working, slowly :( [00:31:20] Kizule: https://integration.wikimedia.org/ci/ is a good place to look when https://integration.wikimedia.org/zuul/ seems "stuck". If the integration-agent-docker-* instances are mostly full of active jobs things are probably working even if a bit overloaded. [00:33:44] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:58] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f6b17bbf280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:33:58] org/wiki/Search%23Administration [00:35:18] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:32] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 667, active_shards: 1502, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:35:32] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:38:34] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:40:10] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [00:40:54] thcipriani, bd808: Yeah, I understand. Thanks for making me to learn something now. And I'm sorry for bothering you, I just have good intentions. [00:41:18] Kizule: no bother, thanks for caring :) [00:41:39] ^ +1 to that. [00:42:37] That's what I actually mean. You're more than welcome. :) [00:42:50] RECOVERY - Check systemd state on logstash2026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:37:46] (JobUnavailable) firing: (2) Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:42:46] (JobUnavailable) firing: (8) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:48:36] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [01:57:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:07:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:17:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:46] (JobUnavailable) resolved: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:26:38] RECOVERY - Work requests waiting in Zuul Gearman server on contint2001 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [03:46:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [03:48:37] 10ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T319126 (10phaultfinder) [04:17:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:31:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:57:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:02:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:42:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:47:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:36:39] (03PS1) 10KartikMistry: ContentTranslation: Increase MT threshold for publishing in cswiki by 20% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875192 (https://phabricator.wikimedia.org/T324721) [06:58:55] (03PS1) 10Marostegui: Revert "db2131: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/874876 [06:59:13] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 1%: After testing', diff saved to https://phabricator.wikimedia.org/P42749 and previous config saved to /var/cache/conftool/dbconfig/20230104-065912-root.json [06:59:52] (03CR) 10Marostegui: [C: 03+2] Revert "db2131: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/874876 (owner: 10Marostegui) [07:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T0700) [07:08:49] (03PS1) 10Muehlenhoff: Add missing account dates/contacts for two accounts [puppet] - 10https://gerrit.wikimedia.org/r/875193 [07:08:51] <_joe_> helloo [07:09:17] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add connection timeout to sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/869755 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto) [07:10:14] (03CR) 10Muehlenhoff: [C: 03+2] Add missing account dates/contacts for two accounts [puppet] - 10https://gerrit.wikimedia.org/r/875193 (owner: 10Muehlenhoff) [07:14:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 5%: After testing', diff saved to https://phabricator.wikimedia.org/P42750 and previous config saved to /var/cache/conftool/dbconfig/20230104-071417-root.json [07:15:36] (03Merged) 10jenkins-bot: mediawiki: add connection timeout to sendmail [deployment-charts] - 10https://gerrit.wikimedia.org/r/869755 (https://phabricator.wikimedia.org/T325131) (owner: 10Giuseppe Lavagetto) [07:17:48] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) [07:19:11] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:19:19] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:20:03] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:20:08] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:20:34] 10SRE, 10Infrastructure-Foundations: Integrate Buster 10.13 point update - https://phabricator.wikimedia.org/T317413 (10MoritzMuehlenhoff) [07:24:27] (03PS1) 10Muehlenhoff: Remove LDAP access for relgu [puppet] - 10https://gerrit.wikimedia.org/r/875197 [07:27:49] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for relgu [puppet] - 10https://gerrit.wikimedia.org/r/875197 (owner: 10Muehlenhoff) [07:29:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 10%: After testing', diff saved to https://phabricator.wikimedia.org/P42751 and previous config saved to /var/cache/conftool/dbconfig/20230104-072922-root.json [07:35:04] !log dbmaint eqiad deploy schema change on x1 T255174 [07:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:08] T255174: Extend echo_unread_wikis.euw_wiki - https://phabricator.wikimedia.org/T255174 [07:35:22] (03PS1) 10Muehlenhoff: Remove LDAP access for emedina [puppet] - 10https://gerrit.wikimedia.org/r/875256 [07:35:24] !log dbmaint codfw deploy schema change on x1 T255174 [07:35:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:35:37] (03PS1) 10Marostegui: Revert "mariadb: Change x1 to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/874877 [07:36:38] (03CR) 10Marostegui: [C: 03+2] Revert "mariadb: Change x1 to STATEMENT" [puppet] - 10https://gerrit.wikimedia.org/r/874877 (owner: 10Marostegui) [07:38:08] !log Switch x1 back to RBR T255174 [07:38:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:38:10] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [07:38:19] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [07:38:31] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [07:38:37] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [07:39:24] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for emedina [puppet] - 10https://gerrit.wikimedia.org/r/875256 (owner: 10Muehlenhoff) [07:44:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 25%: After testing', diff saved to https://phabricator.wikimedia.org/P42752 and previous config saved to /var/cache/conftool/dbconfig/20230104-074427-root.json [07:49:11] (03CR) 10Muehlenhoff: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [07:51:47] (03CR) 10Muehlenhoff: [C: 03+1] "All existing accounts have been fixed, ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [07:54:51] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10ayounsi) @RobH I don't think there is a need to depool the site as optic are hot-swappable the risk of killing the router is quite low (unless they start hammering at th... [07:59:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 50%: After testing', diff saved to https://phabricator.wikimedia.org/P42753 and previous config saved to /var/cache/conftool/dbconfig/20230104-075932-root.json [08:00:05] Amir1 and Urbanecm: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T0800). [08:00:05] matthiasmullie: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:37] o/ [08:03:29] I'll self-service my backports [08:03:59] (03PS5) 10Matthias Mullie: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) [08:04:15] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) (owner: 10Matthias Mullie) [08:05:00] (03Merged) 10jenkins-bot: [SearchVue] Enable on ruwiki (beta) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/849025 (https://phabricator.wikimedia.org/T311667) (owner: 10Matthias Mullie) [08:07:12] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [skins/MinervaNeue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874890 (https://phabricator.wikimedia.org/T321377) (owner: 10Matthias Mullie) [08:14:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 75%: After testing', diff saved to https://phabricator.wikimedia.org/P42754 and previous config saved to /var/cache/conftool/dbconfig/20230104-081437-root.json [08:20:51] !log dbmaint codfw deploy schema change on s8 T255174 [08:20:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:54] T255174: Extend echo_unread_wikis.euw_wiki - https://phabricator.wikimedia.org/T255174 [08:22:08] !log dbmaint eqiad deploy schema change on s8 T255174 [08:22:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:25] (03Merged) 10jenkins-bot: Always show search results at full width [skins/MinervaNeue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874890 (https://phabricator.wikimedia.org/T321377) (owner: 10Matthias Mullie) [08:23:50] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:874890|Always show search results at full width (T321377)]] [08:23:53] T321377: [S] Figure out how to show SearchVue on small screens in non-mobile skins - https://phabricator.wikimedia.org/T321377 [08:25:42] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:874890|Always show search results at full width (T321377)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:26:28] !log dbmaint eqiad deploy schema change on s4 T255174 [08:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:30] T255174: Extend echo_unread_wikis.euw_wiki - https://phabricator.wikimedia.org/T255174 [08:26:31] !log dbmaint codfw deploy schema change on s4 T255174 [08:26:31] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:49] !log dbmaint codfw deploy schema change on s4 T326011 [08:26:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:51] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [08:26:52] !log dbmaint eqiad deploy schema change on s4 T326011 [08:26:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:55] !log dbmaint eqiad deploy schema change on s8 T326011 [08:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:58] !log dbmaint codfw deploy schema change on s8 T326011 [08:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2131 (re)pooling @ 100%: After testing', diff saved to https://phabricator.wikimedia.org/P42755 and previous config saved to /var/cache/conftool/dbconfig/20230104-082942-root.json [08:31:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:31:17] PROBLEM - Check systemd state on puppetdb1003 is CRITICAL: CRITICAL - degraded: The following units failed: uwsgi.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:12] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:874890|Always show search results at full width (T321377)]] (duration: 08m 22s) [08:32:15] T321377: [S] Figure out how to show SearchVue on small screens in non-mobile skins - https://phabricator.wikimedia.org/T321377 [08:32:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874889 (https://phabricator.wikimedia.org/T321377) (owner: 10Matthias Mullie) [08:40:49] (03CR) 10Ayounsi: [C: 03+2] Add Peering News to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [08:41:59] (03CR) 10Muehlenhoff: [C: 03+2] Mask uwsgi on puppetdb hosts [puppet] - 10https://gerrit.wikimedia.org/r/874852 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [08:47:48] (03Merged) 10jenkins-bot: Change IW breakpoint to be enabled on smaller screen [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874889 (https://phabricator.wikimedia.org/T321377) (owner: 10Matthias Mullie) [08:48:11] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:874889|Change IW breakpoint to be enabled on smaller screen (T321377)]] [08:48:14] T321377: [S] Figure out how to show SearchVue on small screens in non-mobile skins - https://phabricator.wikimedia.org/T321377 [08:48:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb1003.eqiad.wmnet [08:48:53] !log jelto@cumin1001 START - Cookbook sre.gitlab.reboot-runner rolling reboot on A:gitlab-runner [08:50:01] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:874889|Change IW breakpoint to be enabled on smaller screen (T321377)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet [08:53:11] RECOVERY - DPKG on puppetdb1003 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [08:53:23] RECOVERY - Check systemd state on puppetdb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:57:07] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:874889|Change IW breakpoint to be enabled on smaller screen (T321377)]] (duration: 08m 56s) [08:57:10] T321377: [S] Figure out how to show SearchVue on small screens in non-mobile skins - https://phabricator.wikimedia.org/T321377 [08:57:19] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mlitn@deploy1002 using scap backport" [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874887 (owner: 10Matthias Mullie) [08:59:08] (03Merged) 10jenkins-bot: Squashed diff to catch up to wmf/1.40.0-wmf.17 [extensions/SearchVue] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874887 (owner: 10Matthias Mullie) [08:59:33] !log mlitn@deploy1002 Started scap: Backport for [[gerrit:874887|Squashed diff to catch up to wmf/1.40.0-wmf.17]] [08:59:46] (03CR) 10JMeybohm: blackbox::check::http: change expiry check value from days to seconds (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866594 (owner: 10Jbond) [09:00:05] dduvall and hashar: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T0900). [09:00:05] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host puppetdb1003.eqiad.wmnet [09:00:28] train will be run tonight by Dan [09:01:21] !log mlitn@deploy1002 mlitn and mlitn: Backport for [[gerrit:874887|Squashed diff to catch up to wmf/1.40.0-wmf.17]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [09:05:51] (03CR) 10JMeybohm: [C: 04-1] service::catalog: Add aux-k8s-ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [09:06:21] (03PS1) 10Ayounsi: Peering news: minor improvments and bug fixes [puppet] - 10https://gerrit.wikimedia.org/r/875259 [09:07:46] !log mlitn@deploy1002 Finished scap: Backport for [[gerrit:874887|Squashed diff to catch up to wmf/1.40.0-wmf.17]] (duration: 08m 13s) [09:08:00] !log UTC morning backports done [09:08:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:35] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [09:11:16] (03PS2) 10Ayounsi: Peering news: minor improvments and bug fixes [puppet] - 10https://gerrit.wikimedia.org/r/875259 [09:11:58] (03PS11) 10FNegri: Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) [09:13:19] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [09:15:25] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 120 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:17:01] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:18:43] (03CR) 10David Caro: tools-webservice: read buildservice_repository from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [09:18:47] (03PS3) 10Ayounsi: Peering news: minor improvments and bug fixes [puppet] - 10https://gerrit.wikimedia.org/r/875259 [09:18:52] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [09:20:21] (03CR) 10CI reject: [V: 04-1] Peering news: minor improvments and bug fixes [puppet] - 10https://gerrit.wikimedia.org/r/875259 (owner: 10Ayounsi) [09:23:08] jouncebot: nowandnext [09:23:09] For the next 1 hour(s) and 36 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T0900) [09:23:09] In 1 hour(s) and 36 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1100) [09:26:08] (03PS4) 10Ayounsi: Peering news: minor improvments and bug fixes [puppet] - 10https://gerrit.wikimedia.org/r/875259 [09:29:46] !log jelto@cumin1001 END (PASS) - Cookbook sre.gitlab.reboot-runner (exit_code=0) rolling reboot on A:gitlab-runner [09:33:03] (03PS6) 10Filippo Giunchedi: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [09:33:05] (03PS1) 10Filippo Giunchedi: prometheus: bump scrape timeout for webperf [puppet] - 10https://gerrit.wikimedia.org/r/875264 (https://phabricator.wikimedia.org/T326118) [09:34:12] (03CR) 10Ayounsi: [C: 03+2] "self merging as minor changes and tested locally" [puppet] - 10https://gerrit.wikimedia.org/r/875259 (owner: 10Ayounsi) [09:34:21] (03PS1) 10Hashar: phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) [09:34:23] (03PS1) 10Hashar: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) [09:34:53] (03CR) 10Hashar: [C: 04-1] "/var/run/phpd is used as the phd user home directory. On boot, the directory does not exist since it is under /run." [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [09:36:13] (03PS7) 10Filippo Giunchedi: monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [09:36:43] (03CR) 10CI reject: [V: 04-1] phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [09:37:38] (03CR) 10Filippo Giunchedi: "Only a bandaid really to get more data out, should help a little though" [puppet] - 10https://gerrit.wikimedia.org/r/875264 (https://phabricator.wikimedia.org/T326118) (owner: 10Filippo Giunchedi) [09:37:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetdb2003.codfw.wmnet [09:37:49] !log dbmaint codfw deploy schema change on s5 T326011 [09:37:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:52] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [09:40:39] I'm going to run reboots on mwdebug/mw hosts, starting in about 10 minutes if nobody objects. [09:43:01] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:44:24] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 56286 [09:45:35] (03CR) 10Filippo Giunchedi: [C: 03+1] "I accidentally an old PS, restored it now and change LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [09:45:36] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 56286 [09:47:39] (03CR) 10Ayounsi: [C: 03+2] drmrs: offload traffic from Tata [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [09:47:46] (JobUnavailable) firing: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:47:47] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:48:18] !log drmrs: offload traffic from Tata - T324955 [09:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:21] (03CR) 10Filippo Giunchedi: [C: 03+2] services: remove old graphite hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/859579 (https://phabricator.wikimedia.org/T315524) (owner: 10Filippo Giunchedi) [09:50:03] (03Merged) 10jenkins-bot: drmrs: offload traffic from Tata [homer/public] - 10https://gerrit.wikimedia.org/r/867128 (https://phabricator.wikimedia.org/T324955) (owner: 10Ayounsi) [09:53:12] !log Upload imposm3_0.11.1-1 to buster-wikimedia - T325293 [09:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:15] T325293: OSM import fails on both eqiad/codfw because of wrong data input - https://phabricator.wikimedia.org/T325293 [09:57:40] (03CR) 10Phedenskog: [C: 03+1] "Great :)" [puppet] - 10https://gerrit.wikimedia.org/r/875264 (https://phabricator.wikimedia.org/T326118) (owner: 10Filippo Giunchedi) [09:59:13] PROBLEM - Check systemd state on puppetdb2003 is CRITICAL: CRITICAL - degraded: The following units failed: postgresql@15-main.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:00:07] (03PS3) 10Volans: admin: Add eileen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/870705 (https://phabricator.wikimedia.org/T325608) (owner: 10BCornwall) [10:00:30] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Volans) [10:00:34] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [10:00:38] (03CR) 10Muehlenhoff: [C: 03+2] Fix up comment wrt use of restrict for WMCS [puppet] - 10https://gerrit.wikimedia.org/r/870824 (owner: 10Muehlenhoff) [10:01:00] hashar: train-presync.service is failed on deploy1002 because 1 hosts had sync_wikiversions errors. It's parse1002, probably because it ran before it was put in inactive. I'll clear the failure. [10:01:09] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [10:01:10] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [10:01:21] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:01:26] (03PS2) 10Hashar: phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) [10:01:28] (03PS2) 10Hashar: phabricator: change phd home dir to /var/lib/phd [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) [10:01:38] * volans looking at puppetdb2003 [10:02:12] moritzm: it seems postgres is not happy after the reboot [10:02:50] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [10:02:51] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [10:02:59] yeah, currently trying to figure out why... [10:03:02] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [10:03:03] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [10:03:10] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [10:03:11] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:03:19] (03CR) 10Volans: [C: 03+1] Revert "rake - spdx: also check hiera files" [puppet] - 10https://gerrit.wikimedia.org/r/869801 (owner: 10Jbond) [10:03:21] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:03:22] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:03:29] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:03:34] !log filippo@deploy1002 helmfile [codfw] [main] START helmfile.d/services/mw-jobrunner : sync [10:03:34] !log filippo@deploy1002 helmfile [codfw] [canary] START helmfile.d/services/mw-jobrunner : sync [10:03:40] !log filippo@deploy1002 helmfile [codfw] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:03:40] !log filippo@deploy1002 helmfile [codfw] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:03:44] !log filippo@deploy1002 helmfile [eqiad] [canary] START helmfile.d/services/mw-jobrunner : sync [10:03:44] !log filippo@deploy1002 helmfile [eqiad] [main] START helmfile.d/services/mw-jobrunner : sync [10:03:48] !log filippo@deploy1002 helmfile [eqiad] [canary] DONE helmfile.d/services/mw-jobrunner : sync [10:03:50] !log filippo@deploy1002 helmfile [eqiad] [main] DONE helmfile.d/services/mw-jobrunner : sync [10:03:51] !log filippo@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [10:03:53] sorry for the spam ^ applying ACL changes [10:03:57] moritzm: /srv/postgres/15/main is empty [10:04:02] !log filippo@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [10:04:03] !log filippo@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [10:04:15] !log filippo@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [10:04:24] moritzm: it seems to me that the db is in /var/lib/postgresql/15/main [10:04:47] !log Rolling reboot of mwdebug hosts in codfw [10:04:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:48] !log dbmaint eqiad deploy schema change on s5 T326011 [10:04:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:51] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [10:04:59] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:06:37] (03CR) 10Volans: [C: 03+2] "LGTM, approved on task." [puppet] - 10https://gerrit.wikimedia.org/r/870705 (https://phabricator.wikimedia.org/T325608) (owner: 10BCornwall) [10:07:07] volans: I think I found the issue, the Hiera config for the new bookworm pair is broken, making a patch [10:07:16] ack [10:07:33] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: bump scrape timeout for webperf [puppet] - 10https://gerrit.wikimedia.org/r/875264 (https://phabricator.wikimedia.org/T326118) (owner: 10Filippo Giunchedi) [10:07:39] (03PS2) 10Filippo Giunchedi: prometheus: bump scrape timeout for webperf [puppet] - 10https://gerrit.wikimedia.org/r/875264 (https://phabricator.wikimedia.org/T326118) [10:08:48] (03CR) 10Jbond: [C: 03+2] Revert "rake - spdx: also check hiera files" [puppet] - 10https://gerrit.wikimedia.org/r/869801 (owner: 10Jbond) [10:12:09] (03Abandoned) 10Effie Mouzeli: Switch Thumbor haproxy load balancing to IP hash [puppet] - 10https://gerrit.wikimedia.org/r/636024 (https://phabricator.wikimedia.org/T266155) (owner: 10Gilles) [10:13:44] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:14:46] !log Rolling reboot of mwdebug hosts in eqiad [10:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:54] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:15:43] (03PS1) 10Jbond: hieradata: drop SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/875270 [10:16:57] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [10:17:46] (JobUnavailable) resolved: Reduced availability for job webperf_navtiming in ext@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:17:47] (03PS1) 10Muehlenhoff: Properly name replication records for second puppetdb pair [puppet] - 10https://gerrit.wikimedia.org/r/875272 (https://phabricator.wikimedia.org/T321783) [10:18:17] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [10:18:41] (03PS1) 10Effie Mouzeli: maps: enable OSM sync after imposm3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/875273 (https://phabricator.wikimedia.org/T325293) [10:18:48] (03CR) 10Muehlenhoff: [C: 03+2] Add a license statement to puppet.git with an overview [puppet] - 10https://gerrit.wikimedia.org/r/870813 (owner: 10Muehlenhoff) [10:20:35] (03CR) 10Effie Mouzeli: [C: 03+2] maps: enable OSM sync after imposm3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/875273 (https://phabricator.wikimedia.org/T325293) (owner: 10Effie Mouzeli) [10:20:49] (03CR) 10Volans: [C: 03+1] "LGTM, just some nits for removing leading empty line" [puppet] - 10https://gerrit.wikimedia.org/r/875270 (owner: 10Jbond) [10:21:00] (03PS2) 10Effie Mouzeli: maps: enable OSM sync on eqiad after imposm3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/875273 (https://phabricator.wikimedia.org/T325293) [10:21:06] (03CR) 10Effie Mouzeli: [V: 03+2] maps: enable OSM sync on eqiad after imposm3 upgrade [puppet] - 10https://gerrit.wikimedia.org/r/875273 (https://phabricator.wikimedia.org/T325293) (owner: 10Effie Mouzeli) [10:22:10] (03CR) 10DCausse: [C: 03+1] wdqs: make depool the default behavior [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [10:22:20] (03CR) 10Jbond: [C: 03+1] Make sure cloud_cumin public key is evaluated [puppet] - 10https://gerrit.wikimedia.org/r/868732 (https://phabricator.wikimedia.org/T323483) (owner: 10FNegri) [10:22:22] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [10:22:32] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [10:23:28] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [10:24:11] (03CR) 10Volans: [C: 03+2] admin: ensure all contractors have an expiry date [puppet] - 10https://gerrit.wikimedia.org/r/874809 (owner: 10Volans) [10:24:31] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-cluster (exit_code=0) [10:25:16] (03CR) 10Jbond: [C: 03+2] monitoring: update monitoring files to dynamically discover config [puppet] - 10https://gerrit.wikimedia.org/r/869716 (https://phabricator.wikimedia.org/T321783) (owner: 10Jbond) [10:26:09] (03PS31) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [10:26:44] (03CR) 10CI reject: [V: 04-1] P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [10:29:15] !log Rolling reboot of api_appserver hosts in codfw [10:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:17] PROBLEM - Check systemd state on ms-fe2009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:35] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [10:29:36] PROBLEM - Check systemd state on es1031 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:39] PROBLEM - Check systemd state on db2129 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:53] PROBLEM - Check systemd state on db1199 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:55] PROBLEM - Check systemd state on db1196 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:29:55] RECOVERY - puppet last run on puppetdb1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [10:29:59] PROBLEM - Check systemd state on db2113 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:30:01] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01947 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [10:30:41] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [10:31:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db2124 T326206', diff saved to https://phabricator.wikimedia.org/P42756 and previous config saved to /var/cache/conftool/dbconfig/20230104-103109-marostegui.json [10:31:13] T326206: Move db1176 and db2151 to s6 - https://phabricator.wikimedia.org/T326206 [10:31:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/873791 (https://phabricator.wikimedia.org/T323096) (owner: 10Ryan Kemper) [10:32:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [10:33:01] (03PS1) 10Btullis: Update the role used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875275 (https://phabricator.wikimedia.org/T324670) [10:33:03] RECOVERY - Check systemd state on db1199 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:33:47] PROBLEM - Check systemd state on es1020 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:09] PROBLEM - Check systemd state on thumbor2004 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:21] RECOVERY - Check systemd state on es1031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:25] RECOVERY - Check systemd state on db2129 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:34:30] jbond: are those alerts related to your change? [10:34:36] (03CR) 10Btullis: [C: 03+2] Update the role used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875275 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [10:34:41] I have run puppet on some hosts and it seems to fix it, but I am wondering if we'll get a storm of them [10:35:01] 10SRE, 10SRE-Access-Requests: Requesting access to Turnilo/ logstash for USER:eileen - https://phabricator.wikimedia.org/T325608 (10Volans) The above patch is merged, I've also granted `eileen` the `wmf` LDAP group as it's required to access the resources listed in the task description. @Eileenmcnaughton plea... [10:35:15] PROBLEM - Check systemd state on thumbor1001 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:15] RECOVERY - Check systemd state on db1196 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:17] PROBLEM - Check systemd state on thumbor2005 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:37:09] (03PS1) 10Marostegui: mariadb: Productionize db2151 [puppet] - 10https://gerrit.wikimedia.org/r/875276 (https://phabricator.wikimedia.org/T326206) [10:37:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:37:23] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:37:49] PROBLEM - Check systemd state on db1199 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:38:06] * jbond looking [10:40:21] PROBLEM - Check systemd state on thumbor2003 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:40:31] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875270 (owner: 10Jbond) [10:40:34] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2151 [puppet] - 10https://gerrit.wikimedia.org/r/875276 (https://phabricator.wikimedia.org/T326206) (owner: 10Marostegui) [10:41:03] RECOVERY - Check systemd state on db1199 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:43:43] PROBLEM - Check systemd state on ms-fe1009 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:45:32] (03CR) 10Muehlenhoff: phabricator: change phd home dir to /var/lib/phd (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875266 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [10:45:49] PROBLEM - Check systemd state on db1199 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:04] ^^^ this relates to python3.5 issues sending a fix now [10:47:26] (03PS1) 10Jbond: prometheus-puppet-agent-stats: ensure compatibility with python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/875278 [10:47:45] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 894 hosts [10:48:21] PROBLEM - Check systemd state on thumbor1002 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus_puppet_agent_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:48:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 894 hosts [10:48:30] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging AKhatun out of all services on: 1098 hosts [10:48:34] (03CR) 10Jbond: [C: 03+2] prometheus-puppet-agent-stats: ensure compatibility with python 3.5 [puppet] - 10https://gerrit.wikimedia.org/r/875278 (owner: 10Jbond) [10:49:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging AKhatun out of all services on: 1098 hosts [10:51:31] RECOVERY - Check systemd state on thumbor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:17] RECOVERY - Check systemd state on thumbor2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:52:49] RECOVERY - Check systemd state on thumbor1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:09] RECOVERY - Check systemd state on thumbor1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:15] RECOVERY - Check systemd state on ms-fe2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:19] RECOVERY - Check systemd state on ms-fe1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:21] RECOVERY - Check systemd state on thumbor2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:24] (03PS2) 10Jbond: hieradata: drop SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/875270 [10:55:34] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/875272 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [10:56:05] RECOVERY - Check systemd state on es1020 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:56:06] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875270 (owner: 10Jbond) [10:56:08] (03CR) 10Jbond: [C: 03+2] hieradata: drop SPDX headers [puppet] - 10https://gerrit.wikimedia.org/r/875270 (owner: 10Jbond) [10:56:27] !log (apt1001) import HAproxy 2.4.20 from third-party repo for buster and bullseye [10:56:28] (03CR) 10Jbond: [C: 03+2] hieradata: drop SPDX headers (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/875270 (owner: 10Jbond) [10:56:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:56:46] (03PS32) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [10:56:57] RECOVERY - Check systemd state on db1199 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:05] RECOVERY - Check systemd state on db2113 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:57:06] (03PS33) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1100) [11:02:11] !log testing HAProxy 2.4.20 in cp4037 and cp4045 [11:02:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:59] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.000998 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [11:08:14] (03CR) 10Hashar: "This would makes Puppet to configure the IP alias before starting Apache which might solve the issue when running Puppet on an empty host." [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) (owner: 10Dzahn) [11:08:52] (03PS1) 10Jbond: peering_news: fix typo and add spec test [puppet] - 10https://gerrit.wikimedia.org/r/875280 [11:09:34] (03CR) 10Jbond: "post merge comment: i noticed you already fixed the args handling but noticed there is also a typo see: https://gerrit.wikimedia.org/r/c/o" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [11:09:49] (03PS34) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 [11:09:52] (03CR) 10CI reject: [V: 04-1] peering_news: fix typo and add spec test [puppet] - 10https://gerrit.wikimedia.org/r/875280 (owner: 10Jbond) [11:09:57] (03CR) 10Jbond: [C: 03+2] peering_news: fix typo and add spec test [puppet] - 10https://gerrit.wikimedia.org/r/875280 (owner: 10Jbond) [11:10:40] (03CR) 10Effie Mouzeli: P:mediawiki::mcrouter_wancache minor refactoring (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [11:10:47] (03PS2) 10Jbond: peering_news: fix typo and add spec test [puppet] - 10https://gerrit.wikimedia.org/r/875280 [11:11:36] (03CR) 10Jbond: [C: 03+2] peering_news: fix typo and add spec test [puppet] - 10https://gerrit.wikimedia.org/r/875280 (owner: 10Jbond) [11:12:09] (03CR) 10CI reject: [V: 04-1] P:mediawiki::mcrouter_wancache minor refactoring [puppet] - 10https://gerrit.wikimedia.org/r/860102 (owner: 10Effie Mouzeli) [11:15:10] (03PS1) 10Majavah: openstack: designate: use the enc api to update git data [puppet] - 10https://gerrit.wikimedia.org/r/875281 (https://phabricator.wikimedia.org/T318504) [11:16:17] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38961/console" [puppet] - 10https://gerrit.wikimedia.org/r/875281 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [11:18:28] (03PS3) 10KartikMistry: WIP: Enable Content Translation/Section Translation on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870080 (https://phabricator.wikimedia.org/T325714) [11:18:38] (03PS1) 10Majavah: openstack: horizon: use the enc api to update git data [puppet] - 10https://gerrit.wikimedia.org/r/875283 (https://phabricator.wikimedia.org/T318504) [11:19:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38962/console" [puppet] - 10https://gerrit.wikimedia.org/r/875283 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [11:23:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 1%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42758 and previous config saved to /var/cache/conftool/dbconfig/20230104-112300-root.json [11:25:24] (03PS1) 10Marostegui: instances.yaml: Add db2151 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/875285 (https://phabricator.wikimedia.org/T326206) [11:26:19] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2151 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/875285 (https://phabricator.wikimedia.org/T326206) (owner: 10Marostegui) [11:28:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Add db2151 to dbctl depooled T326206', diff saved to https://phabricator.wikimedia.org/P42759 and previous config saved to /var/cache/conftool/dbconfig/20230104-112801-marostegui.json [11:28:07] T326206: Move db1176 and db2151 to s6 - https://phabricator.wikimedia.org/T326206 [11:33:15] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host puppetdb2003.codfw.wmnet [11:38:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 5%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42761 and previous config saved to /var/cache/conftool/dbconfig/20230104-113805-root.json [11:43:53] (03CR) 10Muehlenhoff: [C: 03+2] Properly name replication records for second puppetdb pair [puppet] - 10https://gerrit.wikimedia.org/r/875272 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [11:46:15] (03PS1) 10Marostegui: change_cul_user_T326011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875288 (https://phabricator.wikimedia.org/T326011) [11:48:35] (03CR) 10Ladsgroup: [C: 03+1] change_cul_user_T326011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875288 (https://phabricator.wikimedia.org/T326011) (owner: 10Marostegui) [11:49:51] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2109.codfw.wmnet with reason: Maintenance [11:49:58] (03CR) 10Marostegui: [C: 03+2] change_cul_user_T326011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875288 (https://phabricator.wikimedia.org/T326011) (owner: 10Marostegui) [11:50:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2109.codfw.wmnet with reason: Maintenance [11:50:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T326011)', diff saved to https://phabricator.wikimedia.org/P42763 and previous config saved to /var/cache/conftool/dbconfig/20230104-115011-marostegui.json [11:50:14] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [11:50:20] (03Merged) 10jenkins-bot: change_cul_user_T326011.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875288 (https://phabricator.wikimedia.org/T326011) (owner: 10Marostegui) [11:51:57] (03PS1) 10Btullis: Update the partman recipe that is used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875290 (https://phabricator.wikimedia.org/T324670) [11:53:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 10%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42764 and previous config saved to /var/cache/conftool/dbconfig/20230104-115310-root.json [11:53:23] (03PS1) 10Marostegui: Create 2023 directory [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875291 [11:53:50] (03CR) 10Marostegui: [C: 03+2] Create 2023 directory [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875291 (owner: 10Marostegui) [11:54:13] (03Merged) 10jenkins-bot: Create 2023 directory [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875291 (owner: 10Marostegui) [11:54:39] (03PS1) 10Urbanecm: Add namespace translations in Wayuu [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874879 (https://phabricator.wikimedia.org/T321881) [11:54:59] (03PS1) 10Urbanecm: Add namespace translations in Wayuu [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874880 (https://phabricator.wikimedia.org/T321881) [11:55:40] (03CR) 10Urbanecm: [C: 03+2] "backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874880 (https://phabricator.wikimedia.org/T321881) (owner: 10Urbanecm) [11:56:02] (03CR) 10Urbanecm: [C: 03+2] "backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874879 (https://phabricator.wikimedia.org/T321881) (owner: 10Urbanecm) [11:58:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T326011)', diff saved to https://phabricator.wikimedia.org/P42765 and previous config saved to /var/cache/conftool/dbconfig/20230104-115844-marostegui.json [11:58:48] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [12:00:04] Urbanecm and Amir1: It is that lovely time of the day again! You are hereby commanded to deploy New wikis creation. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1200). [12:00:37] let’s do it [12:00:41] new wikis :O [12:00:56] (03PS2) 10Btullis: Update the partman recipe that is used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875290 (https://phabricator.wikimedia.org/T324670) [12:01:17] taavi:yup yup, its again this time of the year :) [12:01:30] o/ [12:03:56] * urbanecm is creating patches [12:08:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 25%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42766 and previous config saved to /var/cache/conftool/dbconfig/20230104-120815-root.json [12:08:43] Thank you Martin <3 [12:09:10] (03PS3) 10Btullis: Update the partman recipe that is used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875290 (https://phabricator.wikimedia.org/T324670) [12:10:01] (03PS1) 10Urbanecm: Initial configuration for aswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875292 (https://phabricator.wikimedia.org/T321246) [12:11:12] * urbanecm feels its very confusing to have guwikiquote and guwwikiquote at the same time [12:11:54] oh gosh [12:12:10] (03Merged) 10jenkins-bot: Add namespace translations in Wayuu [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874880 (https://phabricator.wikimedia.org/T321881) (owner: 10Urbanecm) [12:12:17] * urbanecm is not the one who assigns lang codes [12:12:37] (03Merged) 10jenkins-bot: Add namespace translations in Wayuu [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874879 (https://phabricator.wikimedia.org/T321881) (owner: 10Urbanecm) [12:12:41] (03PS1) 10Urbanecm: Initial configuration for guwwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875293 (https://phabricator.wikimedia.org/T321247) [12:13:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P42767 and previous config saved to /var/cache/conftool/dbconfig/20230104-121350-marostegui.json [12:15:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874880 (https://phabricator.wikimedia.org/T321881) (owner: 10Urbanecm) [12:15:55] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874879 (https://phabricator.wikimedia.org/T321881) (owner: 10Urbanecm) [12:16:16] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:874880|Add namespace translations in Wayuu (T321881)]], [[gerrit:874879|Add namespace translations in Wayuu (T321881)]] [12:16:22] T321881: Add namespace translations in Wayuu - https://phabricator.wikimedia.org/T321881 [12:17:41] (03PS4) 10Btullis: Update the partman recipe that is used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875290 (https://phabricator.wikimedia.org/T324670) [12:18:22] !log urbanecm@deploy1002 urbanecm and urbanecm: Backport for [[gerrit:874880|Add namespace translations in Wayuu (T321881)]], [[gerrit:874879|Add namespace translations in Wayuu (T321881)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [12:20:54] (03PS1) 10Urbanecm: Initial configuration for shnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875298 (https://phabricator.wikimedia.org/T321248) [12:22:12] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for aswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875292 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [12:23:20] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42768 and previous config saved to /var/cache/conftool/dbconfig/20230104-122320-root.json [12:23:25] (03Merged) 10jenkins-bot: Initial configuration for aswikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875292 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [12:24:53] addwiki completed with no errors this time [12:25:39] (03PS13) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [12:25:41] (03PS1) 10Jbond: phabricator: Add type for task validation [puppet] - 10https://gerrit.wikimedia.org/r/875299 [12:25:43] (03PS1) 10Jbond: dns_lookup: import dnslookup to evaluate if usefull vendor upstream [puppet] - 10https://gerrit.wikimedia.org/r/875300 [12:25:58] aswikiquote is live at mwdebug [12:26:53] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:874880|Add namespace translations in Wayuu (T321881)]], [[gerrit:874879|Add namespace translations in Wayuu (T321881)]] (duration: 10m 36s) [12:26:54] Err, I'm in the middle of rebooting appservers [12:26:56] T321881: Add namespace translations in Wayuu - https://phabricator.wikimedia.org/T321881 [12:27:01] Want me to stop while you deploy stuff? [12:27:16] urbanecm: Amir1 ^^ [12:27:19] claime: was just going to complain scap complained. [12:27:43] not sure how quick it is on your side. I'll be deploying for the next hour more or less [12:27:47] claime: how long it's going to take? [12:27:49] It is not quick [12:27:55] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [12:28:02] let's stop it yeah [12:28:06] +1 to stopping [12:28:08] Gimme a sec [12:28:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P42769 and previous config saved to /var/cache/conftool/dbconfig/20230104-122857-marostegui.json [12:29:13] (03CR) 10CI reject: [V: 04-1] O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [12:29:19] (03CR) 10CI reject: [V: 04-1] phabricator: Add type for task validation [puppet] - 10https://gerrit.wikimedia.org/r/875299 (owner: 10Jbond) [12:29:35] (03PS1) 10Muehlenhoff: puppetdb/bookworm: One more typo in the config [puppet] - 10https://gerrit.wikimedia.org/r/875301 (https://phabricator.wikimedia.org/T321783) [12:30:33] Amir1: urbanecm, reboots stopped, appservers repooled [12:30:37] thanks! [12:30:40] continuing [12:30:40] (03PS5) 10Btullis: Update the partman recipe that is used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875290 (https://phabricator.wikimedia.org/T324670) [12:30:51] If it complains on parse1002 though, it's supposed to be pooled=inactive (it's got a broken CPU) [12:30:59] !log urbanecm@deploy1002 Started scap: Creating aswikiquote (T321246) [12:31:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:31:02] T321246: Create Wikiquote Assamese - https://phabricator.wikimedia.org/T321246 [12:32:03] (03CR) 10Btullis: [C: 03+2] Update the partman recipe that is used for the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/875290 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [12:32:10] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/875301 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [12:32:36] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for guwwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875293 (https://phabricator.wikimedia.org/T321247) (owner: 10Urbanecm) [12:32:49] (03PS2) 10Jbond: phabricator: Add type for task validation [puppet] - 10https://gerrit.wikimedia.org/r/875299 [12:32:51] (03PS2) 10Jbond: dns_lookup: import dnslookup to evaluate if usefull vendor upstream [puppet] - 10https://gerrit.wikimedia.org/r/875300 [12:32:53] (03PS14) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [12:32:57] Should I have locked scap? Just so I know for next time? [12:33:32] (03Merged) 10jenkins-bot: Initial configuration for guwwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875293 (https://phabricator.wikimedia.org/T321247) (owner: 10Urbanecm) [12:35:07] (03CR) 10CI reject: [V: 04-1] phabricator: Add type for task validation [puppet] - 10https://gerrit.wikimedia.org/r/875299 (owner: 10Jbond) [12:35:24] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1001.eqiad.wmnet with OS bullseye [12:35:56] claime: that'd help, i think. I've also recorded the wiki creation at https://wikitech.wikimedia.org/wiki/Deployments (albeit i did that only yesterday), but i'm not sure if it's clear that window involves the appservers. [12:36:30] urbanecm: No I just ran over without noticing, that's not your fault [12:37:04] ah, i see. [12:37:33] scap doesn't complain about any host so far. [12:38:20] Cool [12:38:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42770 and previous config saved to /var/cache/conftool/dbconfig/20230104-123825-root.json [12:38:48] !log urbanecm@deploy1002 Finished scap: Creating aswikiquote (T321246) (duration: 07m 49s) [12:38:52] T321246: Create Wikiquote Assamese - https://phabricator.wikimedia.org/T321246 [12:39:22] (03PS15) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [12:40:11] !log Rolling reboot of api_appserver hosts in codfw paused for https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1200 [12:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:55] guwwikiquote works at debug server, syncing [12:41:08] !log urbanecm@deploy1002 Started scap: Creating guwwikiquote (T321247) [12:41:11] T321247: Create Wikiquote Gungbe - https://phabricator.wikimedia.org/T321247 [12:41:33] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [12:41:42] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [12:41:44] (03PS2) 10Jelto: gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) [12:42:16] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for shnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875298 (https://phabricator.wikimedia.org/T321248) (owner: 10Urbanecm) [12:42:54] (03PS1) 10Urbanecm: Add messages for Gorontalo Wiktionary (gorwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874882 (https://phabricator.wikimedia.org/T326137) [12:43:07] (03PS1) 10Urbanecm: Add messages for Gorontalo Wiktionary (gorwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874883 (https://phabricator.wikimedia.org/T326137) [12:43:13] (03CR) 10Urbanecm: [C: 03+2] Add messages for Gorontalo Wiktionary (gorwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874883 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [12:43:15] (03Merged) 10jenkins-bot: Initial configuration for shnwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875298 (https://phabricator.wikimedia.org/T321248) (owner: 10Urbanecm) [12:43:18] (03CR) 10Urbanecm: [C: 03+2] Add messages for Gorontalo Wiktionary (gorwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874882 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [12:43:55] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38964/console" [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [12:44:04] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T326011)', diff saved to https://phabricator.wikimedia.org/P42771 and previous config saved to /var/cache/conftool/dbconfig/20230104-124403-marostegui.json [12:44:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2127.codfw.wmnet with reason: Maintenance [12:44:07] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [12:44:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2127.codfw.wmnet with reason: Maintenance [12:44:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T326011)', diff saved to https://phabricator.wikimedia.org/P42772 and previous config saved to /var/cache/conftool/dbconfig/20230104-124424-marostegui.json [12:48:53] !log urbanecm@deploy1002 Finished scap: Creating guwwikiquote (T321247) (duration: 07m 44s) [12:48:56] T321247: Create Wikiquote Gungbe - https://phabricator.wikimedia.org/T321247 [12:50:43] shnwikibooks works at debug server, syncing [12:50:55] !log urbanecm@deploy1002 Started scap: Creating shnwikibooks (T321248) [12:50:58] T321248: Create Wikibooks Shan - https://phabricator.wikimedia.org/T321248 [12:51:18] (03PS1) 10Urbanecm: Initial configuration for gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875302 (https://phabricator.wikimedia.org/T326137) [12:51:55] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [12:53:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T326011)', diff saved to https://phabricator.wikimedia.org/P42773 and previous config saved to /var/cache/conftool/dbconfig/20230104-125310-marostegui.json [12:53:14] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [12:53:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: After cloning db2151', diff saved to https://phabricator.wikimedia.org/P42774 and previous config saved to /var/cache/conftool/dbconfig/20230104-125330-root.json [12:53:39] (03CR) 10Ayounsi: [C: 03+2] "indeed! thanks for finding/fixing the typo!" [puppet] - 10https://gerrit.wikimedia.org/r/849114 (owner: 10Ayounsi) [12:54:17] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cephosd1001.eqiad.wmnet with reason: host reimage [12:55:38] (03PS1) 10Urbanecm: aswikiquote: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875303 (https://phabricator.wikimedia.org/T321246) [12:55:59] !log installing emacs security updates [12:56:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:57:14] (03CR) 10Ayounsi: [C: 03+2] cr-cloud: permit LDAPS traffic too [homer/public] - 10https://gerrit.wikimedia.org/r/869736 (owner: 10Majavah) [12:57:50] (03Merged) 10jenkins-bot: cr-cloud: permit LDAPS traffic too [homer/public] - 10https://gerrit.wikimedia.org/r/869736 (owner: 10Majavah) [12:58:34] !log urbanecm@deploy1002 Finished scap: Creating shnwikibooks (T321248) (duration: 07m 38s) [12:58:37] T321248: Create Wikibooks Shan - https://phabricator.wikimedia.org/T321248 [12:59:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874883 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [13:00:00] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [extensions/WikimediaMessages] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874882 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [13:00:24] (03Merged) 10jenkins-bot: Add messages for Gorontalo Wiktionary (gorwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874883 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [13:00:27] (03Merged) 10jenkins-bot: Add messages for Gorontalo Wiktionary (gorwiktionary) [extensions/WikimediaMessages] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/874882 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [13:00:53] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:874883|Add messages for Gorontalo Wiktionary (gorwiktionary) (T326137)]], [[gerrit:874882|Add messages for Gorontalo Wiktionary (gorwiktionary) (T326137)]] [13:00:59] T326137: Create Wiktionary Gorontalo - https://phabricator.wikimedia.org/T326137 [13:01:15] (03CR) 10Urbanecm: [C: 03+2] Initial configuration for gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875302 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [13:01:44] 10SRE, 10Stashbot: SAL on wikitech missing data - https://phabricator.wikimedia.org/T244766 (10LSobanski) 05Open→03Resolved a:03LSobanski I can see the SAL data between Jan 1st and Feb 5th 2020 on https://wikitech.wikimedia.org/wiki/Server_Admin_Log/Archive_40. Resolving. [13:01:56] (03Merged) 10jenkins-bot: Initial configuration for gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875302 (https://phabricator.wikimedia.org/T326137) (owner: 10Urbanecm) [13:02:36] marostegui: I suggest waiting for a bit, let's get all new wikis deployed [13:02:58] yup, i'll create one more database [13:03:10] Amir1: Yeah, not doing anything for now. Just assigning to myself so it is clear there's still sanitization pending and hence views shouldn't be created :) [13:03:26] 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10cloud-services-team (Kanban): Switch labstore servers to default SSH configuration - https://phabricator.wikimedia.org/T177914 (10LSobanski) [13:03:27] cool [13:08:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P42775 and previous config saved to /var/cache/conftool/dbconfig/20230104-130816-marostegui.json [13:09:09] scap seems to be stalled at building some k8s stuff for about six minutes now [13:11:21] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [13:11:36] (03PS1) 10Urbanecm: Initial configuration for vewikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875304 (https://phabricator.wikimedia.org/T320890) [13:12:02] is there a way to verify scap is doing anything useful at the build-and-push-container-images stage? [13:12:13] it creates a log file in your home directory [13:12:22] yes, ~/scap-image-build-and-push-log [13:12:30] thanks [13:12:46] that file has some output, so waiting [13:14:56] it looks done [13:15:04] yup, now scap moved forward [13:15:16] !log fix missmatch MTU on cloudsw switches - T315838 [13:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:20] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [13:18:05] (03CR) 10Jbond: O:installserver::light: Update ACL to be based on roles (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [13:19:58] 10SRE, 10MediaWiki-Shell, 10serviceops: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10LSobanski) [13:20:24] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38965/console" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [13:20:40] 10SRE, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff: Evaluate/integrate eatmydata in d-i - https://phabricator.wikimedia.org/T278312 (10LSobanski) [13:21:06] 10SRE, 10Beta-Cluster-Infrastructure: cannot curl to wiki from beta mw appservers - https://phabricator.wikimedia.org/T278599 (10LSobanski) 05Open→03Resolved a:03LSobanski Resolving based on the previous comment. Please reopen if this is not a satisfactory solution. [13:22:22] (03PS4) 10Slyngshede: WIP: Access Requests [software/bitu] - 10https://gerrit.wikimedia.org/r/870747 [13:22:33] scap is now stalled at the sync-testservers stage :-/ [13:22:49] (03CR) 10Jelto: [V: 03+1 C: 03+2] gitlab: reduce backup_keep_time to 1d [puppet] - 10https://gerrit.wikimedia.org/r/829747 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [13:23:12] (03CR) 10Jbond: [V: 03+1] "Ready to review. PCC shows restricted.bastion.wmflabs.org as 172.16.5., however that's because the pcc hosts see the split dns answer. W" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [13:23:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P42776 and previous config saved to /var/cache/conftool/dbconfig/20230104-132323-marostegui.json [13:24:56] finally moving forward :/ [13:26:58] (03PS1) 10Zabe: Start reading from cul_actor on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875305 (https://phabricator.wikimedia.org/T233004) [13:27:21] 10SRE, 10Data-Services, 10Infrastructure-Foundations, 10User-MoritzMuehlenhoff, 10cloud-services-team (Kanban): Switch labstore servers to default SSH configuration - https://phabricator.wikimedia.org/T177914 (10MoritzMuehlenhoff) [13:29:04] 10SRE, 10Infrastructure-Foundations: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371 (10LSobanski) @MoritzMuehlenhoff the only reference I could find to a DSA key is in https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/ssh/t... [13:31:54] !log New wiki creation will run over by a couple of minutes [13:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:15] (03CR) 10Raymond Ndibe: tools-webservice: read buildservice_repository from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [13:32:53] (03PS3) 10Jbond: phabricator: Add type for task validation [puppet] - 10https://gerrit.wikimedia.org/r/875299 [13:32:55] (03PS3) 10Jbond: dns_lookup: import dnslookup to evaluate if usefull vendor upstream [puppet] - 10https://gerrit.wikimedia.org/r/875300 [13:32:57] (03PS16) 10Jbond: O:installserver::light: Update ACL to be based on roles [puppet] - 10https://gerrit.wikimedia.org/r/869224 [13:32:59] (03PS31) 10Jbond: P:installserver::proxy: Add global whitelist and list mappings [puppet] - 10https://gerrit.wikimedia.org/r/753029 (https://phabricator.wikimedia.org/T300977) [13:33:44] !log fix missmatch MTU on pfw3-codfw - T315838 [13:33:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:07] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [13:36:37] (03CR) 10Raymond Ndibe: tools-webservice: read buildservice_repository from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [13:36:55] 10SRE, 10TimedMediaHandler-Transcode: Increase job runners on video scalers to maximize load efficiency - https://phabricator.wikimedia.org/T201358 (10LSobanski) Neither the dashboard nor the config file exist anymore. Resolving, please reopen if you still consider this to be a valid request. [13:37:49] (03PS1) 10Muehlenhoff: Stop using DSA host keys also for cloud vps instances [puppet] - 10https://gerrit.wikimedia.org/r/875306 (https://phabricator.wikimedia.org/T177371) [13:38:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T326011)', diff saved to https://phabricator.wikimedia.org/P42777 and previous config saved to /var/cache/conftool/dbconfig/20230104-133830-marostegui.json [13:38:32] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:38:33] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [13:38:38] 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Phase out DSA keys for SSH access (ssh-dss) - https://phabricator.wikimedia.org/T177371 (10MoritzMuehlenhoff) All users with production access have RSA or ed25519 keys, but I need to double-check if there are Cloud VPS users with a DSA key. I've create... [13:38:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:39:17] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:874883|Add messages for Gorontalo Wiktionary (gorwiktionary) (T326137)]], [[gerrit:874882|Add messages for Gorontalo Wiktionary (gorwiktionary) (T326137)]] (duration: 38m 23s) [13:39:20] T326137: Create Wiktionary Gorontalo - https://phabricator.wikimedia.org/T326137 [13:39:22] finally! [13:39:33] why the hell did it take 38 minutes... [13:39:33] marostegui: most of the schema changes can be done with replication [13:41:15] (03CR) 10FNegri: [C: 03+2] "Sorry for the long delay, merging this now." [puppet] - 10https://gerrit.wikimedia.org/r/854479 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [13:41:22] !log drain esams-eqiad link for mtu change - T315838 [13:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:31] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [13:42:35] (03CR) 10FNegri: [C: 03+2] Drop old wikilabels database roles [puppet] - 10https://gerrit.wikimedia.org/r/854478 (https://phabricator.wikimedia.org/T307389) (owner: 10Majavah) [13:42:42] rebuilding localisation cache for wmf.14 and wmf.17, I guess? [13:42:50] (03PS2) 10Urbanecm: aswikiquote: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875303 (https://phabricator.wikimedia.org/T321246) [13:42:53] (03CR) 10Urbanecm: [C: 03+2] aswikiquote: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875303 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [13:43:42] (03Merged) 10jenkins-bot: aswikiquote: Add logo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875303 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [13:43:50] zabe: maybe? my last sync (two minutes before starting the super-long one) took 7 minutes, which is what i'm confused by. [13:44:22] did you backport new messages or something like that? [13:44:26] !log urbanecm@deploy1002 Started scap: Creating gorwiktionary (T326137), fixing aswikiquote logo (T321246) [13:44:30] T326137: Create Wiktionary Gorontalo - https://phabricator.wikimedia.org/T326137 [13:44:31] T321246: Create Wikiquote Assamese - https://phabricator.wikimedia.org/T321246 [13:44:47] yes he did [13:44:59] !log repool esams-eqiad link for mtu change - T315838 [13:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:24] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:45:28] good point. although i recall doing that in some prior deployment when it didn't result in 38 minutes long scap sync. [13:45:38] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:45:43] * urbanecm will expect this next time he backports messages [13:45:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T326011)', diff saved to https://phabricator.wikimedia.org/P42778 and previous config saved to /var/cache/conftool/dbconfig/20230104-134544-marostegui.json [13:45:47] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [13:47:40] (03PS1) 10Marostegui: drop_sppr_entity_T326221.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875307 (https://phabricator.wikimedia.org/T326221) [13:48:13] (03CR) 10Marostegui: "Amir please review the check, I am not sure it will work really" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875307 (https://phabricator.wikimedia.org/T326221) (owner: 10Marostegui) [13:49:09] <_joe_> jouncebot: nowandnext [13:49:10] No deployments scheduled for the next 0 hour(s) and 10 minute(s) [13:49:10] In 0 hour(s) and 10 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1400) [13:49:27] _joe_: i'm still deploying for the previous window [13:49:42] (03CR) 10Ladsgroup: "This table is empty in all wikis except votewiki and it's only 1500 rows in votewiki. I think this schema change can be ran with replicati" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875307 (https://phabricator.wikimedia.org/T326221) (owner: 10Marostegui) [13:50:49] (03CR) 10Marostegui: "Oh cool \o/" [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875307 (https://phabricator.wikimedia.org/T326221) (owner: 10Marostegui) [13:52:18] !log urbanecm@deploy1002 Finished scap: Creating gorwiktionary (T326137), fixing aswikiquote logo (T321246) (duration: 07m 52s) [13:52:26] T326137: Create Wiktionary Gorontalo - https://phabricator.wikimedia.org/T326137 [13:52:26] T321246: Create Wikiquote Assamese - https://phabricator.wikimedia.org/T321246 [13:52:49] (03PS1) 10Urbanecm: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874844 [13:52:53] 10SRE, 10MediaWiki-Shell, 10serviceops: Update limit.sh to support systemd-based cgroup management - https://phabricator.wikimedia.org/T136603 (10Joe) 05Open→03Invalid Since then we've moved to using remote shellbox in production, so I'm not strictly interested anymore in any solution compatible with cgr... [13:52:59] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874844 (owner: 10Urbanecm) [13:53:03] !log dbmaint deploy schema change on s1 T326221 [13:53:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:05] T326221: Add primary key and drop unique index on securepoll_properties on wmf wikis - https://phabricator.wikimedia.org/T326221 [13:53:42] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874844 (owner: 10Urbanecm) [13:54:03] !log dbmaint deploy schema change on s2 T326221 [13:54:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:08] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:874844|Update interwiki cache]] [13:54:30] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T326011)', diff saved to https://phabricator.wikimedia.org/P42779 and previous config saved to /var/cache/conftool/dbconfig/20230104-135429-marostegui.json [13:54:34] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [13:55:14] !log dbmaint deploy schema change on s4 T326221 [13:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:45] (03PS1) 10Bartosz Dziewoński: Mark active sections even when their headings are in wrapper elements [skins/Vector] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874884 (https://phabricator.wikimedia.org/T318044) [13:56:18] !log dbmaint deploy schema change on s5 T326221 [13:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:03] !log dbmaint deploy schema change on s6 T326221 [13:57:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:46] !log dbmaint deploy schema change on s8 T326221 [13:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:42] !log dbmaint deploy schema change on s7 T326221 [13:58:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:46] T326221: Add primary key and drop unique index on securepoll_properties on wmf wikis - https://phabricator.wikimedia.org/T326221 [13:59:40] (03Abandoned) 10Marostegui: drop_sppr_entity_T326221.py: New schema change [software/schema-changes] - 10https://gerrit.wikimedia.org/r/875307 (https://phabricator.wikimedia.org/T326221) (owner: 10Marostegui) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1400). [14:00:05] cirno, zabe, and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] i can deploy today! [14:00:16] (phew) [14:00:31] (03CR) 10Urbanecm: [C: 04-2] "Meets "Grant Interface admins other permissions" listed at https://meta.wikimedia.org/wiki/Limits_to_configuration_changes. I'll comment o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:00:40] hi [14:01:06] cirno: hi, around? [14:01:23] (03CR) 10Urbanecm: [C: 03+2] Mark active sections even when their headings are in wrapper elements [skins/Vector] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874884 (https://phabricator.wikimedia.org/T318044) (owner: 10Bartosz Dziewoński) [14:01:31] hey [14:01:36] hi zabe [14:02:09] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:874844|Update interwiki cache]] (duration: 08m 00s) [14:02:24] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM but you will need to change the chart version to a higher subnumber 😊" [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (https://phabricator.wikimedia.org/T288375) (owner: 10Clément Goubert) [14:02:27] !log updating buster nodes running 5.10 to 5.10.158-2~deb10u1 (only rollout of the new kernel, no reboots) [14:02:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:35] o/ [14:02:51] !log dbmaint deploy schema change on s3 T326221 [14:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:58] !log fix inconsistent mtu on mr1-ulsfo - T315838 [14:03:35] (03PS4) 10Urbanecm: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823) (owner: 10Stang) [14:03:47] (03PS2) 10Urbanecm: kuwiki: Install SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870988 (https://phabricator.wikimedia.org/T325469) (owner: 10Stang) [14:04:15] T326221: Add primary key and drop unique index on securepoll_properties on wmf wikis - https://phabricator.wikimedia.org/T326221 [14:04:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:54] (03PS6) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 [14:05:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823) (owner: 10Stang) [14:05:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870988 (https://phabricator.wikimedia.org/T325469) (owner: 10Stang) [14:06:05] (03PS1) 10Jelto: gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) [14:06:07] (03Merged) 10jenkins-bot: Revert "trwiki: Add 20 years celebration logos" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870920 (https://phabricator.wikimedia.org/T325823) (owner: 10Stang) [14:06:10] (03Merged) 10jenkins-bot: kuwiki: Install SandboxLink [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870988 (https://phabricator.wikimedia.org/T325469) (owner: 10Stang) [14:06:21] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [14:06:37] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:870920|Revert "trwiki: Add 20 years celebration logos" (T325823)]], [[gerrit:870988|kuwiki: Install SandboxLink (T325469)]] [14:06:41] T325823: Requesting temporary logo change for tr.wikipedia.org - https://phabricator.wikimedia.org/T325823 [14:06:42] T325469: Enable Extension:SandboxLink on ku.wikipedia - https://phabricator.wikimedia.org/T325469 [14:08:22] (03CR) 10Bartosz Dziewoński: plwiki: Add editcontentmodel to interface-admin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:08:25] !log urbanecm@deploy1002 urbanecm and stang: Backport for [[gerrit:870920|Revert "trwiki: Add 20 years celebration logos" (T325823)]], [[gerrit:870988|kuwiki: Install SandboxLink (T325469)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet [14:08:43] hi cirno, can you test at mwdebug1001 please? [14:08:48] looking [14:09:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P42780 and previous config saved to /var/cache/conftool/dbconfig/20230104-140936-marostegui.json [14:09:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on puppetdb2002.codfw.wmnet with reason: maintenance [14:09:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on puppetdb2002.codfw.wmnet with reason: maintenance [14:10:02] urbanecm, both two patches LGTM [14:10:07] !log dbmaint deploy schema change on s7 T326225 [14:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:10] T326225: Add primary key and drop unique index on cn_notice_languages on wmf wikis - https://phabricator.wikimedia.org/T326225 [14:10:11] (03CR) 10Urbanecm: [C: 03+1] "thanks Bartosz." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:10:16] thanks cirno, syncing [14:10:57] !log dbmaint deploy schema change on s7 eqiad T326225 [14:10:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:15] !log dbmaint deploy schema change on s1 eqiad T326221 [14:11:16] !log dbmaint deploy schema change on s2 eqiad T326221 [14:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:18] !log dbmaint deploy schema change on s3 eqiad T326221 [14:11:18] T326221: Add primary key and drop unique index on securepoll_properties on wmf wikis - https://phabricator.wikimedia.org/T326221 [14:11:19] !log dbmaint deploy schema change on s4 eqiad T326221 [14:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:20] !log dbmaint deploy schema change on s5 eqiad T326221 [14:11:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:22] !log dbmaint deploy schema change on s6 eqiad T326221 [14:11:24] !log dbmaint deploy schema change on s7 eqiad T326221 [14:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:26] !log dbmaint deploy schema change on s8 eqiad T326221 [14:11:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:41] (03CR) 10Jbond: [C: 03+1] puppetdb/bookworm: One more typo in the config [puppet] - 10https://gerrit.wikimedia.org/r/875301 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [14:12:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/875306 (https://phabricator.wikimedia.org/T177371) (owner: 10Muehlenhoff) [14:12:26] urbanecm: was the last db created? [14:12:50] marostegui: yes, wiki creation is done for today. i'll update tasks now. [14:13:01] urbanecm: excellent thanks [14:13:23] !log dbmaint deploy schema change on s7 eqiad T326226 [14:13:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:26] T326226: Add primary key and drop unique index on cn_notice_projects on wmf wikis - https://phabricator.wikimedia.org/T326226 [14:13:55] (03PS3) 10Matthias Mullie: [SearchVue] Enable extension on ptwiki, ruwiki & idwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830877 [14:14:58] !log dbmaint deploy schema change on s7 eqiad T326228 [14:15:00] (03CR) 10David Caro: tools-webservice: read buildservice_repository from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [14:15:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:01] T326228: Add primary key and drop unique index on cn_notice_regions on wmf wikis - https://phabricator.wikimedia.org/T326228 [14:15:10] !log fix inconsistent mtu on mr1-esams - T315838 [14:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:13] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [14:16:10] !log Sanitize new wikis T326138 T321294 T321288 T321256 [14:16:15] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:870920|Revert "trwiki: Add 20 years celebration logos" (T325823)]], [[gerrit:870988|kuwiki: Install SandboxLink (T325469)]] (duration: 09m 37s) [14:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:18] T321294: Prepare and check storage layer for aswikiquote - https://phabricator.wikimedia.org/T321294 [14:16:18] T326138: Prepare and check storage layer for gorwiktionary - https://phabricator.wikimedia.org/T326138 [14:16:18] T321256: Prepare and check storage layer for shnwikibooks - https://phabricator.wikimedia.org/T321256 [14:16:19] T321288: Prepare and check storage layer for guwwikiquote - https://phabricator.wikimedia.org/T321288 [14:16:22] T325823: Requesting temporary logo change for tr.wikipedia.org - https://phabricator.wikimedia.org/T325823 [14:16:23] T325469: Enable Extension:SandboxLink on ku.wikipedia - https://phabricator.wikimedia.org/T325469 [14:16:28] cirno: your first two patches are live now. [14:16:45] (03PS2) 10Urbanecm: plwiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:16:48] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:16:54] (03Merged) 10jenkins-bot: Mark active sections even when their headings are in wrapper elements [skins/Vector] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874884 (https://phabricator.wikimedia.org/T318044) (owner: 10Bartosz Dziewoński) [14:16:54] !log urbanecm@deploy1002 backport aborted: (duration: 00m 07s) [14:17:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:17:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [skins/Vector] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/874884 (https://phabricator.wikimedia.org/T318044) (owner: 10Bartosz Dziewoński) [14:17:23] PROBLEM - uWSGI puppetboard -http via nrpe- on puppetboard2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 BAD GATEWAY - 275 bytes in 0.064 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [14:17:26] (03PS4) 10Majavah: openstack: modernize puppetleaks script [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) [14:17:39] (03Merged) 10jenkins-bot: plwiki: Add editcontentmodel to interface-admin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/870978 (https://phabricator.wikimedia.org/T325819) (owner: 10Stang) [14:18:04] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:870978|plwiki: Add editcontentmodel to interface-admin (T325819)]], [[gerrit:874884|Mark active sections even when their headings are in wrapper elements (T318044 T324869)]] [14:18:10] T318044: Active section is determined incorrectly when real active section is inside a wrapper - https://phabricator.wikimedia.org/T318044 [14:18:10] T325819: pl.wiki: editcontentmodel for interface-admin - https://phabricator.wikimedia.org/T325819 [14:18:11] T324869: Wrong headings are bolded in the ToC - https://phabricator.wikimedia.org/T324869 [14:18:44] (03CR) 10Raymond Ndibe: tools-webservice: read buildservice_repository from config (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) (owner: 10Raymond Ndibe) [14:19:01] RECOVERY - uWSGI puppetboard -http via nrpe- on puppetboard2002 is OK: HTTP OK: HTTP/1.1 200 OK - 81963 bytes in 5.535 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/puppetboard [14:19:32] (03CR) 10David Caro: [C: 03+2] openstack: modernize puppetleaks script [puppet] - 10https://gerrit.wikimedia.org/r/849494 (https://phabricator.wikimedia.org/T274666) (owner: 10Majavah) [14:19:54] !log urbanecm@deploy1002 urbanecm and stang and matmarex: Backport for [[gerrit:870978|plwiki: Add editcontentmodel to interface-admin (T325819)]], [[gerrit:874884|Mark active sections even when their headings are in wrapper elements (T318044 T324869)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [14:20:29] MatmaRex: cirno: can you test your patches at mwdebug1001 please? [14:20:37] looking [14:20:38] yeah [14:21:12] my change looks good (testing the TOC on https://fr.wikiquote.org/wiki/Discussion_Wikiquote:Accueil) [14:21:25] LGTM [14:21:30] great, syncing [14:22:12] !log fix inconsistent mtu on mr1-eqsin - T315838 [14:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:16] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [14:22:41] (03PS1) 10Urbanecm: aswikiquote: Set timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875312 (https://phabricator.wikimedia.org/T321246) [14:23:15] (03PS2) 10Urbanecm: Start reading from cul_actor on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:23:22] (03CR) 10Urbanecm: [C: 03+2] Start reading from cul_actor on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:24:06] (03Merged) 10jenkins-bot: Start reading from cul_actor on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:24:21] !log dbmaint deploy schema change on s7 eqiad T326227 [14:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:24:25] T326227: Add primary key and drop unique index on cn_notice_countries on wmf wikis - https://phabricator.wikimedia.org/T326227 [14:24:33] (03PS2) 10Urbanecm: aswikiquote: Set timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875312 (https://phabricator.wikimedia.org/T321246) [14:24:38] (03CR) 10Urbanecm: [C: 03+2] aswikiquote: Set timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875312 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [14:24:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P42781 and previous config saved to /var/cache/conftool/dbconfig/20230104-142442-marostegui.json [14:25:22] (03Merged) 10jenkins-bot: aswikiquote: Set timezone to Asia/Kolkata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875312 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [14:25:47] (03PS1) 10Hashar: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) [14:25:51] (03PS1) 10Hashar: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) [14:27:15] !log fix inconsistent mtu on mr1-codfw - T315838 [14:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:19] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [14:27:36] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:870978|plwiki: Add editcontentmodel to interface-admin (T325819)]], [[gerrit:874884|Mark active sections even when their headings are in wrapper elements (T318044 T324869)]] (duration: 09m 32s) [14:27:41] T318044: Active section is determined incorrectly when real active section is inside a wrapper - https://phabricator.wikimedia.org/T318044 [14:27:41] T325819: pl.wiki: editcontentmodel for interface-admin - https://phabricator.wikimedia.org/T325819 [14:27:42] T324869: Wrong headings are bolded in the ToC - https://phabricator.wikimedia.org/T324869 [14:27:52] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875305 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [14:27:54] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875312 (https://phabricator.wikimedia.org/T321246) (owner: 10Urbanecm) [14:28:11] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [14:28:14] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:875305|Start reading from cul_actor on testwiki (T233004)]], [[gerrit:875312|aswikiquote: Set timezone to Asia/Kolkata (T321246)]] [14:28:20] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:28:24] T321246: Create Wikiquote Assamese - https://phabricator.wikimedia.org/T321246 [14:29:02] (03CR) 10CI reject: [V: 04-1] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [14:30:04] !log urbanecm@deploy1002 urbanecm and urbanecm and zabe: Backport for [[gerrit:875305|Start reading from cul_actor on testwiki (T233004)]], [[gerrit:875312|aswikiquote: Set timezone to Asia/Kolkata (T321246)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [14:30:13] zabe: can you test your patch at mwdebug1001 please? [14:30:19] (03PS1) 10Gerrit maintenance bot: Add guc to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/874845 (https://phabricator.wikimedia.org/T321880) [14:30:26] yep [14:30:50] (03CR) 10FNegri: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/868634 (https://phabricator.wikimedia.org/T323714) (owner: 10David Caro) [14:31:15] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:23] so exciting ^_^ [14:32:00] urbanecm, lgtm, Special:CheckUserLog yields the same result and logstash does not show something bad [14:32:12] awesome [14:32:12] syncing [14:32:21] !log fix inconsistent mtu on mr1-eqiad - T315838 [14:32:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:24] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [14:33:57] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:35:39] (03PS1) 10Majavah: openstack: encapi: improve git pushing [puppet] - 10https://gerrit.wikimedia.org/r/875317 (https://phabricator.wikimedia.org/T318504) [14:37:31] !log dbmaint deploy schema change on s5 eqiad T326223 [14:37:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:34] T326223: Add primary key and drop unique index on swsource_links on wmf wikis - https://phabricator.wikimedia.org/T326223 [14:38:05] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:875305|Start reading from cul_actor on testwiki (T233004)]], [[gerrit:875312|aswikiquote: Set timezone to Asia/Kolkata (T321246)]] (duration: 09m 50s) [14:38:10] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [14:38:10] T321246: Create Wikiquote Assamese - https://phabricator.wikimedia.org/T321246 [14:38:18] zabe: should be live [14:38:22] anything else, anyone? [14:38:22] !log dbmaint deploy schema change on s3 eqiad T326223 [14:38:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:06] thanks :) [14:39:49] no problem [14:39:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T326011)', diff saved to https://phabricator.wikimedia.org/P42782 and previous config saved to /var/cache/conftool/dbconfig/20230104-143949-marostegui.json [14:39:52] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:40:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:40:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:40:13] !log UTC afternoon B&C window done [14:40:19] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2094.codfw.wmnet with reason: Maintenance [14:40:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T326011)', diff saved to https://phabricator.wikimedia.org/P42783 and previous config saved to /var/cache/conftool/dbconfig/20230104-144025-marostegui.json [14:40:27] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [14:40:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:39] claime: i'm done with MW deployment for now, if you want to restart what you were doing earlier? [14:42:29] !log fix inconsistent mtu betwen cr1-eqiad<->lsw1-f1 - T315838 [14:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:34] T315838: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 [14:43:10] https://phabricator.wikimedia.org/T321246#8498633 :P [14:44:13] (03CR) 10David Caro: "When would a commit be empty? Should we monitor those/add extra logs for debugging?" [puppet] - 10https://gerrit.wikimedia.org/r/875317 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:44:40] Amir1: what's wrong about that? the second commit is relevant to that task :)) [14:44:49] !log dbmaint deploy schema change on s5 eqiad T326222 [14:44:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:44:53] T326222: Add primary key and drop unique index on swauthor_links on wmf wikis - https://phabricator.wikimedia.org/T326222 [14:45:01] aah [14:45:05] (03CR) 10David Caro: [C: 03+1] "Just the question LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875317 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:46:18] * urbanecm sometimes backports two commits at once to save time :D [14:46:38] (03CR) 10Majavah: openstack: encapi: improve git pushing (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875317 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:46:43] !log dbmaint deploy schema change on s3 eqiad T326222 [14:46:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:16] yeah, especially now that a single sync takes almost 10 minutes it makes sense to batch a few small commits into one [14:48:23] indeed [14:48:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T326011)', diff saved to https://phabricator.wikimedia.org/P42784 and previous config saved to /var/cache/conftool/dbconfig/20230104-144853-marostegui.json [14:48:58] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [14:51:12] (03CR) 10David Caro: [C: 03+2] openstack: encapi: improve git pushing [puppet] - 10https://gerrit.wikimedia.org/r/875317 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [14:55:01] (03PS6) 10Raymond Ndibe: tools-webservice: read buildservice_repository from webservice.yaml config file [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/867910 (https://phabricator.wikimedia.org/T323689) [14:55:42] (03CR) 10Vlad.shapik: Use blubber via Docker tooling; no longer requires local binary (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779 (owner: 10Brion VIBBER) [14:56:09] (03CR) 10Vlad.shapik: [C: 03+1] "I've checked. Looks good to me." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/865779 (owner: 10Brion VIBBER) [14:58:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [14:59:03] (03PS1) 10Ayounsi: test_mtu: ignore frack + report everything else as failure [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/875321 (https://phabricator.wikimedia.org/T315838) [15:00:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1001" [15:00:12] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cephosd1001.eqiad.wmnet with OS bullseye [15:01:09] (03CR) 10Ayounsi: [C: 03+2] "self merge as trivial" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/875321 (https://phabricator.wikimedia.org/T315838) (owner: 10Ayounsi) [15:01:56] (03Merged) 10jenkins-bot: test_mtu: ignore frack + report everything else as failure [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/875321 (https://phabricator.wikimedia.org/T315838) (owner: 10Ayounsi) [15:02:17] (03CR) 10MVernon: [C: 03+1] "Hi," [puppet] - 10https://gerrit.wikimedia.org/r/859592 (https://phabricator.wikimedia.org/T308677) (owner: 10Jbond) [15:03:56] (03CR) 10Muehlenhoff: [C: 03+2] puppetdb/bookworm: One more typo in the config [puppet] - 10https://gerrit.wikimedia.org/r/875301 (https://phabricator.wikimedia.org/T321783) (owner: 10Muehlenhoff) [15:04:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P42785 and previous config saved to /var/cache/conftool/dbconfig/20230104-150400-marostegui.json [15:05:54] !log dbmaint deploy schema change on s3 eqiad T326224 [15:05:56] !log dbmaint deploy schema change on s5 eqiad T326224 [15:05:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:01] T326224: Add primary key and drop unique index on revsrc on wmf wikis - https://phabricator.wikimedia.org/T326224 [15:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:06] (03PS14) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [15:09:07] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [15:12:46] (03PS2) 10Hashar: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) [15:12:48] (03PS2) 10Hashar: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) [15:12:50] (03PS1) 10Hashar: systemd::unit: support multiple overrides [puppet] - 10https://gerrit.wikimedia.org/r/875347 (https://phabricator.wikimedia.org/T326125) [15:12:54] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10ayounsi) a:05cmooney→03ayounsi Last ones are the Fundraising Infrastructure related links (between cr, pfw and fasw). As most of them are not managed by Netbox, I ignore... [15:15:30] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:16:46] (03CR) 10CI reject: [V: 04-1] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:17:48] (03CR) 10Jbond: [C: 03+2] phabricator: Add type for task validation [puppet] - 10https://gerrit.wikimedia.org/r/875299 (owner: 10Jbond) [15:17:53] (03CR) 10Jbond: [C: 03+2] dns_lookup: import dnslookup to evaluate if usefull vendor upstream [puppet] - 10https://gerrit.wikimedia.org/r/875300 (owner: 10Jbond) [15:19:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P42786 and previous config saved to /var/cache/conftool/dbconfig/20230104-151907-marostegui.json [15:19:23] (03CR) 10Jbond: O:installserver::light: Update ACL to be based on roles (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [15:21:31] (03CR) 10Ayounsi: [C: 03+1] "Awesome!" [puppet] - 10https://gerrit.wikimedia.org/r/869224 (owner: 10Jbond) [15:21:51] (03PS4) 10Ladsgroup: Disable LoadMonitor in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) [15:22:08] jouncebot: nowandnext [15:22:08] No deployments scheduled for the next 2 hour(s) and 37 minute(s) [15:22:08] In 2 hour(s) and 37 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1800) [15:22:13] (03CR) 10Ladsgroup: [C: 03+2] Disable LoadMonitor in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) (owner: 10Ladsgroup) [15:23:03] (03Merged) 10jenkins-bot: Disable LoadMonitor in CLI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) (owner: 10Ladsgroup) [15:23:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874899 (https://phabricator.wikimedia.org/T322156) (owner: 10Ladsgroup) [15:23:31] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:874899|Disable LoadMonitor in CLI (T322156)]] [15:23:45] T322156: New errors during this month's full dump run: LoadBalancer.php: No server with index '4' - https://phabricator.wikimedia.org/T322156 [15:25:12] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:874899|Disable LoadMonitor in CLI (T322156)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [15:26:29] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:51] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:26:59] (03PS3) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:27:01] (03PS1) 10Jbond: systemd: only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875350 [15:28:28] jbond: I had a parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/875347 ;) [15:28:48] using ensure_resource but apparently it is not doing what I am expecting :\ [15:28:52] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38966/console" [puppet] - 10https://gerrit.wikimedia.org/r/875350 (owner: 10Jbond) [15:29:54] jouncebot: nowandnext [15:29:54] No deployments scheduled for the next 2 hour(s) and 30 minute(s) [15:29:54] In 2 hour(s) and 30 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1800) [15:30:03] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [15:30:15] (03CR) 10Hashar: "I have a similar patch at https://gerrit.wikimedia.org/r/c/operations/puppet/+/875347 which also fixes the duplicate exec for systemd dae" [puppet] - 10https://gerrit.wikimedia.org/r/875350 (owner: 10Jbond) [15:32:04] !log Restarting rolling reboot of api_appserver hosts in codfw [15:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:32:10] !log cgoubert@cumin1001 START - Cookbook sre.hosts.reboot-cluster [15:32:17] jbond: I will try again tomorrow :] [15:32:53] (03CR) 10Ahmon Dancy: [C: 03+1] mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [15:33:20] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:874899|Disable LoadMonitor in CLI (T322156)]] (duration: 09m 48s) [15:33:29] T322156: New errors during this month's full dump run: LoadBalancer.php: No server with index '4' - https://phabricator.wikimedia.org/T322156 [15:34:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T326011)', diff saved to https://phabricator.wikimedia.org/P42787 and previous config saved to /var/cache/conftool/dbconfig/20230104-153413-marostegui.json [15:34:16] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:34:17] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [15:34:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T326011)', diff saved to https://phabricator.wikimedia.org/P42788 and previous config saved to /var/cache/conftool/dbconfig/20230104-153435-marostegui.json [15:34:37] !log installing glibc security updates [15:34:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:42] !log installing glibc security updates on bullseye [15:34:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:13] * Krinkle testing on mwdebug1002 [15:41:06] I'm gonna have to stop my reboots again aren't I? [15:41:55] I thought the deployment was done and I was wrong [15:42:10] Gimme a minute before deploying so the current reboot finishes [15:43:09] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T326011)', diff saved to https://phabricator.wikimedia.org/P42789 and previous config saved to /var/cache/conftool/dbconfig/20230104-154308-marostegui.json [15:43:31] (03PS1) 10Muehlenhoff: Enable profile::auto_restarts::service for graphite/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875352 (https://phabricator.wikimedia.org/T135991) [15:45:21] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [15:46:54] (03PS1) 10Muehlenhoff: Add Cumin aliases for aux-k8s [puppet] - 10https://gerrit.wikimedia.org/r/875353 [15:48:23] (03CR) 10Filippo Giunchedi: [C: 03+1] Enable profile::auto_restarts::service for graphite/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875352 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:49:55] (03PS1) 10Muehlenhoff: Add Cumin aliases for analytics postgres hosts [puppet] - 10https://gerrit.wikimedia.org/r/875354 [15:50:17] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for graphite/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875352 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [15:50:36] jouncebot nowandnext [15:50:37] No deployments scheduled for the next 2 hour(s) and 9 minute(s) [15:50:37] In 2 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1800) [15:50:49] dancy: there's a backport running [15:51:28] claime: Thanks. Lemme know when I can proceed to do a scap update. [15:51:32] !log cgoubert@cumin1001 END (ERROR) - Cookbook sre.hosts.reboot-cluster (exit_code=97) [15:56:09] (03CR) 10Btullis: Add Cumin aliases for analytics postgres hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875354 (owner: 10Muehlenhoff) [15:57:59] (03PS1) 10Muehlenhoff: webperf/site: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875355 (https://phabricator.wikimedia.org/T135991) [15:58:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P42790 and previous config saved to /var/cache/conftool/dbconfig/20230104-155815-marostegui.json [15:58:22] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd: only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875350 (owner: 10Jbond) [15:58:58] (03CR) 10Muehlenhoff: Add Cumin aliases for analytics postgres hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875354 (owner: 10Muehlenhoff) [15:59:34] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,name=mw2400.* [15:59:41] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,name=mw2401.* [15:59:48] !log cgoubert@cumin1001 conftool action : set/pooled=yes; selector: cluster=api_appserver,name=mw2402.* [16:01:07] Amir1: Krinkle: Reboots stopped, go ahead [16:01:24] I'm not doing anything tbh [16:01:29] Unless Krinkle is doing something [16:03:49] lock cleared [16:03:50] go ahead [16:04:19] dancy: ^ [16:04:22] claime: no deployment, just staging on mwdebug [16:04:29] Oh ok [16:04:42] not sure I understand why that would interfer? is the reboot script checking scap locks? [16:05:30] No, but if you were doing an actual deployment scap would have logged failures for the api servers currently rebooting [16:08:36] ok [16:13:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P42791 and previous config saved to /var/cache/conftool/dbconfig/20230104-161321-marostegui.json [16:15:59] Krinkle: Just confirmed, okay for me to deploy a new version of scap now? It takes about a minute [16:16:06] *just confirming.. [16:17:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/875353 (owner: 10Muehlenhoff) [16:18:18] (03PS7) 10Clément Goubert: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 [16:19:57] (03CR) 10Clément Goubert: [C: 03+1] mediawiki: allow rsyslog to process the apache logs [deployment-charts] - 10https://gerrit.wikimedia.org/r/864548 (https://phabricator.wikimedia.org/T265876) (owner: 10Giuseppe Lavagetto) [16:25:26] (03CR) 10Clément Goubert: [C: 03+2] mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [16:25:54] (03CR) 10JHathaway: [C: 03+1] "looks good to me, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/875353 (owner: 10Muehlenhoff) [16:26:28] dancy: yes [16:26:52] thx [16:27:16] !log dancy@deploy1002 Started scap: (no justification provided) [16:27:29] !log dancy@deploy1002 sync-world aborted: (no justification provided) (duration: 00m 13s) [16:27:32] dancy: to clarify, I lock deploy1002 (simply via touch /var/lock/scap-global-lock) whenever I am cherry picking something there to pull only on mwdebug1002 to test something ad-hoc, so as to prevent someone from accidentally deploying and/or deploying something else that wipes out my test. [16:27:48] Gotcha. [16:28:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T326011)', diff saved to https://phabricator.wikimedia.org/P42792 and previous config saved to /var/cache/conftool/dbconfig/20230104-162828-marostegui.json [16:28:38] (03PS3) 10Ryan Kemper: team-search-platform: relax kafka burrow check [alerts] - 10https://gerrit.wikimedia.org/r/868234 (owner: 10DCausse) [16:28:46] btw `scap lock --all ` is the canonical way to manipulate /var/lock/scap-global-lock [16:29:03] ack [16:29:11] (03CR) 10Giuseppe Lavagetto: "LGTM; I would change "relevant cluster" to "local cluster" in most comments as that is more properly what we're picking by choosing "$::si" [puppet] - 10https://gerrit.wikimedia.org/r/868721 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [16:29:43] !log dancy@deploy1002 Installing scap version "4.31.0" for 560 hosts [16:30:39] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [16:31:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:31:35] (03PS4) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [16:31:56] (03Merged) 10jenkins-bot: mediawiki: Add GeoIP data to chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/870660 (owner: 10Clément Goubert) [16:32:30] (03CR) 10Giuseppe Lavagetto: "My suggestion is to just add the new data structure to the common/ private hiera in one change, then remove it from the two site-specific " [labs/private] - 10https://gerrit.wikimedia.org/r/868718 (https://phabricator.wikimedia.org/T162123) (owner: 10MVernon) [16:32:31] Any known issues with parse1002.eqiad.wmnet ? It hasn't been reachable from the deploy server for a day or two [16:32:46] It's dead jim [16:32:58] It's got a CPU issue [16:33:05] https://phabricator.wikimedia.org/T326119 ? [16:33:07] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:33:10] It is marked inactive in confctl [16:33:13] !log marostegui@cumin1001 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 4:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:33:14] yes [16:33:53] Unfortunately it is still listed in the /etc/group/dsh files on the deploy server so scap still tries to interact with it. [16:34:00] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [16:34:17] ah [16:34:41] dancy: should I remove it via puppet until it's repaired? [16:34:51] Yes please [16:34:56] ack [16:35:08] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:35:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2105.codfw.wmnet with reason: Maintenance [16:35:20] !log dancy@deploy1002 Started scap: testing [16:35:56] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: git_pull_charts.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:36:07] um [16:37:13] ah, it's listed as a scap upgrade target (via puppetdb) and not a mediawiki one [16:37:37] claime: if you just remove it from there, then you have a risk of that node using a different scap version than everything else once it's back [16:38:45] taavi: That's going to happen as it stands because all other nodes have just had a newer scap installed. [16:40:11] yeah [16:40:25] I'll add to the task that it needs a scap upgrade before being put back in the pool [16:40:37] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:41:12] (03CR) 10Thcipriani: [C: 03+2] deploy_artifacts: add dry run mode [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868461 (owner: 10Hashar) [16:41:17] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:41:23] (03CR) 10Thcipriani: [C: 03+2] deploy_artifacts: --version is a required option [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868462 (owner: 10Hashar) [16:41:24] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:41:34] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:42:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:42:25] 10SRE, 10Infrastructure-Foundations, 10fundraising-tech-ops, 10netops: Set consistent MTUs - https://phabricator.wikimedia.org/T315838 (10Dwisehaupt) @ayounsi That window is perfect. I'll add it to our list for the week to make sure we don't forget it. [16:42:26] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1102.eqiad.wmnet with reason: Maintenance [16:42:53] (03Merged) 10jenkins-bot: deploy_artifacts: add dry run mode [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868461 (owner: 10Hashar) [16:42:56] (03Merged) 10jenkins-bot: deploy_artifacts: --version is a required option [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/868462 (owner: 10Hashar) [16:44:29] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1112.eqiad.wmnet with reason: Maintenance [16:44:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1112.eqiad.wmnet with reason: Maintenance [16:44:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:44:58] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [16:45:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T326011)', diff saved to https://phabricator.wikimedia.org/P42793 and previous config saved to /var/cache/conftool/dbconfig/20230104-164504-marostegui.json [16:45:45] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [16:48:36] !log dancy@deploy1002 Finished scap: testing (duration: 13m 16s) [16:49:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T326011)', diff saved to https://phabricator.wikimedia.org/P42794 and previous config saved to /var/cache/conftool/dbconfig/20230104-164915-marostegui.json [16:49:37] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:49:55] !log dancy@deploy1002 Installing scap version "4.30.3-1" for 560 hosts [16:50:29] (03PS1) 10Mforns: Bump up mediawiki_history_snapshot to 2022-12 [puppet] - 10https://gerrit.wikimedia.org/r/875364 [16:51:45] (03CR) 10Dzahn: "this will need testing of course, but the explanation and code I see make sense to me. lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:52:15] (03CR) 10Dzahn: [C: 03+1] gitlab: stop using "latest" backup name [puppet] - 10https://gerrit.wikimedia.org/r/875309 (https://phabricator.wikimedia.org/T274463) (owner: 10Jelto) [16:53:09] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2010.codfw.wmnet, kubernetes2007.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2022.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:53:41] (03PS1) 10Clément Goubert: Revert "mediawiki: Add GeoIP data to chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/875366 [16:53:47] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - mwdebug_4444: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2006.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [16:54:04] it's me [16:54:08] we good [16:54:48] (03CR) 10Clément Goubert: [C: 03+2] Revert "mediawiki: Add GeoIP data to chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/875366 (owner: 10Clément Goubert) [16:54:49] !log xcollazo@deploy1002 Started deploy [airflow-dags/platform_eng@84f5f50]: Bumping platform_eng airflow instance to latest [16:55:00] (03PS5) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [16:55:02] (03PS1) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [16:55:06] !log xcollazo@deploy1002 Finished deploy [airflow-dags/platform_eng@84f5f50]: Bumping platform_eng airflow instance to latest (duration: 00m 17s) [16:55:09] (03CR) 10Ahmon Dancy: [C: 03+1] dsh: Remove parse1002 from parsoid dsh group [puppet] - 10https://gerrit.wikimedia.org/r/875360 (https://phabricator.wikimedia.org/T326119) (owner: 10Clément Goubert) [16:55:30] (03PS1) 10Jon Harald Søby: Add namespace to gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875386 (https://phabricator.wikimedia.org/T326253) [16:56:29] (03CR) 10Dzahn: [V: 03+1] gerrit: require interface::alias before httpd class (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) (owner: 10Dzahn) [16:56:52] (03Abandoned) 10Dzahn: gerrit: require interface::alias before httpd class [puppet] - 10https://gerrit.wikimedia.org/r/874939 (https://phabricator.wikimedia.org/T326125) (owner: 10Dzahn) [16:57:16] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Sadads) I believe that has changed and this could be closed - @FRomeo_WMF -- I believe this is now managed by you right? [16:57:38] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [16:58:16] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] dsh: Remove parse1002 from parsoid dsh group [puppet] - 10https://gerrit.wikimedia.org/r/875360 (https://phabricator.wikimedia.org/T326119) (owner: 10Clément Goubert) [16:58:39] (03CR) 10CI reject: [V: 04-1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [16:59:33] (03CR) 10Dzahn: [V: 03+1] phabricator: add systemd::tmpfile snippet for phd run dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [16:59:54] (03Merged) 10jenkins-bot: Revert "mediawiki: Add GeoIP data to chart" [deployment-charts] - 10https://gerrit.wikimedia.org/r/875366 (owner: 10Clément Goubert) [17:00:01] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:00:10] 10ops-eqiad, 10DC-Ops, 10serviceops, 10Patch-For-Review: hw troubleshooting: CPU1 machine check error on parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T326119 (10Clement_Goubert) >>! In T326119#8499305, @gerritbot wrote: > Change 875360 **merged** by Clément Goubert: > %%%[operations/puppet@pr... [17:00:21] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:01:40] (03CR) 10Dzahn: [V: 03+1] "Like, if this is a working solution then how can it be easier to create another one instead?" [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [17:03:19] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:04:19] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:04:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P42795 and previous config saved to /var/cache/conftool/dbconfig/20230104-170421-marostegui.json [17:07:01] (03PS1) 10Ottomata: [WIP] modules/mesh - Support mesh service proxy without exposing a Service for public_port [deployment-charts] - 10https://gerrit.wikimedia.org/r/875387 (https://phabricator.wikimedia.org/T326252) [17:08:04] (03CR) 10Dzahn: [C: 03+2] "approved by langcom at https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Wayuu" [dns] - 10https://gerrit.wikimedia.org/r/874845 (https://phabricator.wikimedia.org/T321880) (owner: 10Gerrit maintenance bot) [17:09:26] (03CR) 10Clément Goubert: service::catalog: Add aux-k8s-ingress (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/868101 (https://phabricator.wikimedia.org/T325178) (owner: 10Clément Goubert) [17:10:04] !log new Wikipedia (and other projects) language added: guc - https://en.wikipedia.org/wiki/Wayuu_language - https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Wayuu T321880 [17:10:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:15] T321880: Create Wikipedia Wayuu - https://phabricator.wikimedia.org/T321880 [17:14:48] (03CR) 10Dzahn: admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [17:15:03] (03CR) 10Dzahn: admin: create new group deployment-jenkins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [17:17:40] (03CR) 10Dzahn: "I don't think we need more than 8 but https://phabricator.wikimedia.org/T1 is a valid task. So I guess {1,8} or {1,9}." [puppet] - 10https://gerrit.wikimedia.org/r/875299 (owner: 10Jbond) [17:19:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112', diff saved to https://phabricator.wikimedia.org/P42796 and previous config saved to /var/cache/conftool/dbconfig/20230104-171928-marostegui.json [17:20:47] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) I can confirm glam@ says it's a GSuite account, still, when I look from the WMF prod mail server side. [17:21:12] 10SRE, 10Znuny, 10serviceops-collab: Convert glam@wikimedia.org OTRS into a Google Group - https://phabricator.wikimedia.org/T233843 (10Dzahn) a:05Astinson→03FRomeo_WMF [17:28:20] !log dancy@deploy1002 Started scap: testing [17:31:18] 10SRE, 10SRE-swift-storage, 10Analytics-Radar, 10Recommendation-API: Run swift-object-expirer as part of the swift cluster - https://phabricator.wikimedia.org/T229584 (10LSobanski) [17:31:54] (03CR) 10Andrew Bogott: [C: 03+2] "<3 to see a patch that's just deletes!" [puppet] - 10https://gerrit.wikimedia.org/r/875281 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [17:33:09] (03CR) 10Andrew Bogott: [C: 03+2] openstack: horizon: use the enc api to update git data [puppet] - 10https://gerrit.wikimedia.org/r/875283 (https://phabricator.wikimedia.org/T318504) (owner: 10Majavah) [17:34:19] (03PS1) 10JHathaway: create a vrts profile [labs/private] - 10https://gerrit.wikimedia.org/r/875393 [17:34:35] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1112 (T326011)', diff saved to https://phabricator.wikimedia.org/P42797 and previous config saved to /var/cache/conftool/dbconfig/20230104-173434-marostegui.json [17:34:36] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1123.eqiad.wmnet with reason: Maintenance [17:34:39] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [17:34:49] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1123.eqiad.wmnet with reason: Maintenance [17:34:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1123 (T326011)', diff saved to https://phabricator.wikimedia.org/P42798 and previous config saved to /var/cache/conftool/dbconfig/20230104-173455-marostegui.json [17:36:11] !log dancy@deploy1002 Finished scap: testing (duration: 07m 50s) [17:36:15] (03CR) 10JHathaway: [C: 03+2] create a vrts profile [labs/private] - 10https://gerrit.wikimedia.org/r/875393 (owner: 10JHathaway) [17:36:17] (03CR) 10JHathaway: [V: 03+2 C: 03+2] create a vrts profile [labs/private] - 10https://gerrit.wikimedia.org/r/875393 (owner: 10JHathaway) [17:36:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:37:21] !log dancy@deploy1002 Installing scap version "4.31.1" for 560 hosts [17:39:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T326011)', diff saved to https://phabricator.wikimedia.org/P42799 and previous config saved to /var/cache/conftool/dbconfig/20230104-173905-marostegui.json [17:41:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:44:00] (03PS1) 10Dzahn: add az.wikimedia.org for Azerbaijani Wikimedians User Group [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) [17:44:01] (03PS1) 10Btullis: Update the cephosd partman recipe again [puppet] - 10https://gerrit.wikimedia.org/r/875395 (https://phabricator.wikimedia.org/T324670) [17:45:19] (03PS2) 10Btullis: Update the cephosd partman recipe again [puppet] - 10https://gerrit.wikimedia.org/r/875395 (https://phabricator.wikimedia.org/T324670) [17:46:07] (03PS2) 10Dzahn: add az.wikimedia.org for Azerbaijani Wikimedians User Group [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) [17:50:14] (03CR) 10Dzahn: "I can try to contact via https://en.wikipedia.org/wiki/User_talk:FULBERT (that is the freshly renewed chair of affcom it looks)" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [17:53:14] (03PS1) 10David Caro: DONOTMERGE: Adding tests to puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/875398 [17:53:27] (03CR) 10Krinkle: [C: 04-1] "Holding back for now since testing was unsuccessful due to further breakage in core making it not yet work. Going to focus on https://gerr" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874419 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [17:54:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P42800 and previous config saved to /var/cache/conftool/dbconfig/20230104-175412-marostegui.json [17:54:35] (03CR) 10Btullis: [C: 03+2] Update the cephosd partman recipe again [puppet] - 10https://gerrit.wikimedia.org/r/875395 (https://phabricator.wikimedia.org/T324670) (owner: 10Btullis) [17:54:38] (03PS2) 10David Caro: DONOTMERGE: Adding tests to puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/875398 [17:56:03] (03PS1) 10Majavah: openstack: encapi: fix error checking [puppet] - 10https://gerrit.wikimedia.org/r/875399 [17:57:54] (03CR) 10Krinkle: [C: 03+1] webperf/site: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/875355 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [17:57:56] (03CR) 10CI reject: [V: 04-1] DONOTMERGE: Adding tests to puppet-enc [puppet] - 10https://gerrit.wikimedia.org/r/875398 (owner: 10David Caro) [17:58:43] (03CR) 10Andrew Bogott: [C: 03+2] openstack: encapi: fix error checking [puppet] - 10https://gerrit.wikimedia.org/r/875399 (owner: 10Majavah) [17:59:42] (03CR) 10Dzahn: "https://en.wikipedia.org/wiki/User_talk:FULBERT#affcom_re:_az.wikimedia.org" [dns] - 10https://gerrit.wikimedia.org/r/875394 (https://phabricator.wikimedia.org/T306015) (owner: 10Dzahn) [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1800) [18:00:22] !log btullis@cumin1001 START - Cookbook sre.hosts.reimage for host cephosd1002.eqiad.wmnet with OS bullseye [18:01:53] (03PS1) 10Jbond: phabricator: update pattern to support old tickets like T1 [puppet] - 10https://gerrit.wikimedia.org/r/875401 [18:02:48] (03CR) 10Jbond: [C: 03+2] phabricator: Add type for task validation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875299 (owner: 10Jbond) [18:03:01] (03CR) 10Dzahn: [C: 03+1] "like" [puppet] - 10https://gerrit.wikimedia.org/r/875401 (owner: 10Jbond) [18:03:46] (03CR) 10Jbond: [C: 03+2] phabricator: update pattern to support old tickets like T1 [puppet] - 10https://gerrit.wikimedia.org/r/875401 (owner: 10Jbond) [18:04:35] (03CR) 10Dzahn: [C: 03+2] "just like we did on a bunch of other hosts, deploying" [puppet] - 10https://gerrit.wikimedia.org/r/875355 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:05:28] triple merge [18:09:19] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123', diff saved to https://phabricator.wikimedia.org/P42801 and previous config saved to /var/cache/conftool/dbconfig/20230104-180918-marostegui.json [18:09:57] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: pushing wmf-puppet-dashboard updates for enc git handling [18:12:20] (03CR) 10Jbond: [V: 03+1 C: 03+2] systemd: only create the override directory once (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/875350 (owner: 10Jbond) [18:13:52] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: pushing wmf-puppet-dashboard updates for enc git handling (duration: 03m 54s) [18:14:09] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:14:20] andrewbogott: mutante: my change is very minor and can be merged with any [18:14:45] ACK, mine can also be merged [18:14:50] !log taavi@deploy1002 Started deploy [horizon/deploy@9d02cd6]: pushing wmf-puppet-dashboard updates for enc git handling (after remembering to update the submodules) [18:14:56] andrewbogott: holds the lock [18:15:36] have ping in cloud-admin, might have to kill it and see what andrews change is if we dont hear from them [18:15:43] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. https://wikitech.wikimedia.org/wiki/Monitoring/unmerged_changes [18:15:44] !log taavi@deploy1002 Finished deploy [horizon/deploy@9d02cd6]: pushing wmf-puppet-dashboard updates for enc git handling (after remembering to update the submodules) (duration: 00m 54s) [18:15:46] andrew released the lock, I typed "multiple" . fixed [18:15:55] ack thanks [18:15:55] jbond: fixed [18:16:32] it's weird, this is the first time I am seeing an alert for unmerged changes? [18:16:57] I mean it positively of course that it's a good alert but I am wondering why I didn't notice it before [18:17:06] the alert is not new, but will only alert after 5 min or so [18:17:07] were there simply unmerged changes or something else? [18:17:13] usually we dont have unmerged things for 5 m [18:17:19] interesting! it's probably that then [18:17:39] it was only because we had a 3-way merge [18:17:52] not the right term.. but more than 2 merges [18:18:02] yeah, I guess that's rare :) [18:18:52] the first user has to release the lock/ end puppet-merge and the second user has to do it as well and type "multiple" instead of yes, to confirm. then it's fixed [18:18:57] laters [18:21:06] (03CR) 10Dzahn: [C: 03+2] "ran puppet on webperf1003,webperf2003. the expected systemd timers have been created.. "/Systemd::Service[wmf_auto_restart_envoyproxy]"" [puppet] - 10https://gerrit.wikimedia.org/r/875355 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [18:24:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1123 (T326011)', diff saved to https://phabricator.wikimedia.org/P42802 and previous config saved to /var/cache/conftool/dbconfig/20230104-182425-marostegui.json [18:24:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance [18:24:29] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [18:24:40] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1145.eqiad.wmnet with reason: Maintenance [18:26:41] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [18:26:44] (03PS4) 10Dzahn: admin: create new group deployment-jenkins [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) [18:26:54] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1166.eqiad.wmnet with reason: Maintenance [18:27:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T326011)', diff saved to https://phabricator.wikimedia.org/P42803 and previous config saved to /var/cache/conftool/dbconfig/20230104-182700-marostegui.json [18:27:15] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/output/869276/38970/deploy1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:28:17] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [18:29:23] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "releases1002: Notice: /Stage[main]/Admin/Admin::Hashgroup[deployment-jenkins]/Admin::Group[deployment-jenkins]/Group[deployment-jenkins]/e" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:29:53] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [18:30:13] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "[releases1002:~] $ grep deployment-jenkins /etc/group" [puppet] - 10https://gerrit.wikimedia.org/r/869276 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [18:31:08] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T326011)', diff saved to https://phabricator.wikimedia.org/P42804 and previous config saved to /var/cache/conftool/dbconfig/20230104-183108-marostegui.json [18:31:12] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [18:31:32] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) new admin group `deployment-jenkins` (gid: 838) has been created on deploy* and releases* servers... [18:32:04] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) 05Open→03In progress [18:32:15] (03PS1) 10Jbond: systemd: move exec to systemd [puppet] - 10https://gerrit.wikimedia.org/r/875406 [18:33:35] (03CR) 10CI reject: [V: 04-1] systemd: move exec to systemd [puppet] - 10https://gerrit.wikimedia.org/r/875406 (owner: 10Jbond) [18:33:37] (03PS2) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [18:34:34] (03PS3) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [18:37:54] (03PS1) 10Majavah: secrets: ssh: remove instance-puppet-user key [labs/private] - 10https://gerrit.wikimedia.org/r/875407 [18:40:33] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@84f5f50]: (no justification provided) [18:40:38] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@84f5f50]: (no justification provided) (duration: 00m 05s) [18:43:19] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:21] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:43:47] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:44:50] (03PS2) 10Jbond: systemd: move exec to systemd [puppet] - 10https://gerrit.wikimedia.org/r/875406 [18:44:52] (03PS4) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [18:44:54] (03PS6) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [18:46:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P42805 and previous config saved to /var/cache/conftool/dbconfig/20230104-184614-marostegui.json [18:47:15] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38971/console" [puppet] - 10https://gerrit.wikimedia.org/r/875406 (owner: 10Jbond) [18:47:56] (03CR) 10CI reject: [V: 04-1] systemd: move exec to systemd [puppet] - 10https://gerrit.wikimedia.org/r/875406 (owner: 10Jbond) [18:48:34] (03CR) 10CI reject: [V: 04-1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [18:48:38] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [18:49:29] (03CR) 10Jbond: systemd: move exec to systemd [puppet] - 10https://gerrit.wikimedia.org/r/875406 (owner: 10Jbond) [18:51:24] (03Abandoned) 10Jbond: systemd: move exec to systemd [puppet] - 10https://gerrit.wikimedia.org/r/875406 (owner: 10Jbond) [18:55:36] (03PS5) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [18:55:46] (03PS7) 10Jbond: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [18:58:33] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [18:58:41] (03CR) 10CI reject: [V: 04-1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [19:00:04] dduvall and hashar: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Train log triage with CPT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1900). [19:00:04] dduvall and hashar: #bothumor I � Unicode. All rise for MediaWiki train - Utc-7+Utc-0 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T1900). [19:01:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P42806 and previous config saved to /var/cache/conftool/dbconfig/20230104-190121-marostegui.json [19:07:32] !log dancy@deploy1002 Installing scap version "4.32.0" for 560 hosts [19:08:31] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) OK, I think I finally found the issue and also confirmed the fix on `traffic-cache-bullseye.traffic.eqiad1.wikimedia.cloud`. The TL;DR is that Debian bullseye supports cgroup v2 by default whereas... [19:16:28] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T326011)', diff saved to https://phabricator.wikimedia.org/P42807 and previous config saved to /var/cache/conftool/dbconfig/20230104-191627-marostegui.json [19:16:30] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:16:31] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [19:16:43] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1175.eqiad.wmnet with reason: Maintenance [19:16:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T326011)', diff saved to https://phabricator.wikimedia.org/P42808 and previous config saved to /var/cache/conftool/dbconfig/20230104-191648-marostegui.json [19:16:57] (03PS1) 10TrainBranchBot: group1 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875411 (https://phabricator.wikimedia.org/T325580) [19:16:59] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875411 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:18:03] (03Merged) 10jenkins-bot: group1 wikis to 1.40.0-wmf.17 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875411 (https://phabricator.wikimedia.org/T325580) (owner: 10TrainBranchBot) [19:20:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T326011)', diff saved to https://phabricator.wikimedia.org/P42809 and previous config saved to /var/cache/conftool/dbconfig/20230104-192057-marostegui.json [19:24:26] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) Metrics (default on bullseye, cgroup v2): ` container_cpu_system_seconds_total{id="/system.slice/varnish-frontend-fetcherr.service"} 0 1672859971947 container_cpu_system_seconds_total{id="/system... [19:25:37] !log dduvall@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.40.0-wmf.17 refs T325580 [19:25:40] T325580: 1.40.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T325580 [19:30:44] (03PS1) 10JHathaway: rename role [labs/private] - 10https://gerrit.wikimedia.org/r/875416 [19:31:00] (03CR) 10JHathaway: [C: 03+2] rename role [labs/private] - 10https://gerrit.wikimedia.org/r/875416 (owner: 10JHathaway) [19:31:07] (03CR) 10JHathaway: [V: 03+2 C: 03+2] rename role [labs/private] - 10https://gerrit.wikimedia.org/r/875416 (owner: 10JHathaway) [19:32:35] !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.40.0-wmf.17 refs T325580 (duration: 06m 58s) [19:33:05] T325580: 1.40.0-wmf.17 deployment blockers - https://phabricator.wikimedia.org/T325580 [19:34:11] 10SRE, 10ops-drmrs, 10Infrastructure-Foundations, 10netops: cr2-drmrs:xe-0/1/1 stuck optic - https://phabricator.wikimedia.org/T324555 (10RobH) Well, its pretty much the same on all SFP and XFP optics, looking at the one on my desk right now. The lever engages a very, very small metal plate that engages t... [19:36:05] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P42810 and previous config saved to /var/cache/conftool/dbconfig/20230104-193604-marostegui.json [19:44:22] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) https://salsa.debian.org/systemd-team/systemd/-/commit/170fb124a32884bd9975ee4ea9e1ffbbc2ee26b4 ` - -Ddefault-hierarchy=hybrid \ + -Ddefault-hierarchy=unified \ ` [19:47:32] (03CR) 10Jbond: "the following files would be delted" [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [19:47:42] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10BCornwall) `systemd.unified_cgroup_hierarchy=0` enables systemd's "hybrid" mode, meaning that both v1 and v2 are enabled. Systemd makes its opinion very clear at https://systemd.io/CGROUP_DELEGATION/: >... [19:48:39] (03PS2) 10Andrew Bogott: Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) [19:48:41] (03PS1) 10Andrew Bogott: Openstack Designate eqiad1 -> version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/875422 (https://phabricator.wikimedia.org/T323086) [19:50:15] (03CR) 10CI reject: [V: 04-1] Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [19:50:47] (03CR) 10Jbond: [C: 04-1] "this will delete way too much ill rethink it tomorrow" [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [19:51:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P42811 and previous config saved to /var/cache/conftool/dbconfig/20230104-195110-marostegui.json [19:53:57] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) >>! In T325557#8499916, @BCornwall wrote: > `systemd.unified_cgroup_hierarchy=0` enables systemd's "hybrid" mode, meaning that both v1 and v2 are enabled. Systemd makes its opinion very clear at h... [19:55:30] (03PS6) 10Jbond: systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 [19:55:57] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate eqiad1 -> version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/875422 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [19:58:57] (03CR) 10CI reject: [V: 04-1] systemd: ensure we only create the override directory once [puppet] - 10https://gerrit.wikimedia.org/r/875365 (owner: 10Jbond) [20:04:59] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10BCornwall) Another data point: It seems that some metrics are lost when moving to v2: https://github.com/google/cadvisor/issues/3062 > On nodes running cgroup v1 the following metrics such as container_... [20:06:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T326011)', diff saved to https://phabricator.wikimedia.org/P42812 and previous config saved to /var/cache/conftool/dbconfig/20230104-200617-marostegui.json [20:06:19] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance [20:06:21] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [20:06:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1179.eqiad.wmnet with reason: Maintenance [20:06:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T326011)', diff saved to https://phabricator.wikimedia.org/P42813 and previous config saved to /var/cache/conftool/dbconfig/20230104-200638-marostegui.json [20:10:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T326011)', diff saved to https://phabricator.wikimedia.org/P42814 and previous config saved to /var/cache/conftool/dbconfig/20230104-201047-marostegui.json [20:19:43] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:21:19] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:25:54] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P42815 and previous config saved to /var/cache/conftool/dbconfig/20230104-202554-marostegui.json [20:31:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:40:43] (03PS2) 10Hashar: systemd::unit: support multiple overrides [puppet] - 10https://gerrit.wikimedia.org/r/875347 (https://phabricator.wikimedia.org/T326125) [20:40:45] (03PS8) 10Hashar: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) [20:41:01] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P42816 and previous config saved to /var/cache/conftool/dbconfig/20230104-204100-marostegui.json [20:41:21] (03CR) 10CI reject: [V: 04-1] systemd::unit: support multiple overrides [puppet] - 10https://gerrit.wikimedia.org/r/875347 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:42:05] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:49:02] (03PS3) 10Hashar: systemd::unit: support multiple overrides [puppet] - 10https://gerrit.wikimedia.org/r/875347 (https://phabricator.wikimedia.org/T326125) [20:49:04] (03PS9) 10Hashar: httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) [20:49:19] (03PS3) 10Hashar: gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) [20:52:27] (03CR) 10CI reject: [V: 04-1] httpd: add flag to wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:52:51] (03CR) 10Hashar: "The ensure_resource" [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:53:21] (03CR) 10CI reject: [V: 04-1] gerrit: make Apache wait for network-online.target [puppet] - 10https://gerrit.wikimedia.org/r/875315 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:56:07] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T326011)', diff saved to https://phabricator.wikimedia.org/P42817 and previous config saved to /var/cache/conftool/dbconfig/20230104-205607-marostegui.json [20:56:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [20:56:11] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [20:56:12] (03CR) 10Hashar: "Various mediawiki profiles now fail with:" [puppet] - 10https://gerrit.wikimedia.org/r/875314 (https://phabricator.wikimedia.org/T326125) (owner: 10Hashar) [20:56:22] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1189.eqiad.wmnet with reason: Maintenance [20:56:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T326011)', diff saved to https://phabricator.wikimedia.org/P42818 and previous config saved to /var/cache/conftool/dbconfig/20230104-205628-marostegui.json [20:58:15] (03PS1) 10Gergő Tisza: Fix underlinkedness rescore logic [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875371 (https://phabricator.wikimedia.org/T301096) [20:58:23] !log running refreshGlobalimagelinks.php on all wikis (T322588) [20:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:26] T322588: Run `refreshGlobalimagelinks.php --pages=nonexisting` from the GlobalUsage extension - https://phabricator.wikimedia.org/T322588 [20:58:33] (03PS1) 10Gergő Tisza: Fix underlinkedness rescore logic [extensions/GrowthExperiments] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875372 (https://phabricator.wikimedia.org/T301096) [20:59:14] (03PS3) 10Andrew Bogott: Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) [20:59:16] (03PS1) 10Andrew Bogott: Openstack Designate eqiad1 -> version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/875435 (https://phabricator.wikimedia.org/T323086) [20:59:33] (03CR) 10Hashar: "This change is not directly related to T326146, I simply found out we had two resources defining the phd user and went to merge them in a " [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: It is that lovely time of the day again! You are hereby commanded to deploy UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230104T2100). [21:00:05] zabe, Jhs, MatmaRex, and tgr: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] o/ [21:00:13] (03CR) 10CI reject: [V: 04-1] Nova: puppetize /etc/nova/api-paste.ini [puppet] - 10https://gerrit.wikimedia.org/r/874938 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [21:00:15] o/ I can deploy I guess, unless no-one else wants to [21:00:27] I can do it if you want taavi. [21:00:28] o/ [21:00:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T326011)', diff saved to https://phabricator.wikimedia.org/P42819 and previous config saved to /var/cache/conftool/dbconfig/20230104-210036-marostegui.json [21:00:38] kindrobot: sure, go ahead [21:01:05] (03CR) 10Hashar: [C: 04-1] phabricator: add systemd::tmpfile snippet for phd run dir (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/874943 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn) [21:01:09] zabe we'll do yours first. [21:01:11] i guess my thing isn't needed any more, Amir1 is already doing it [21:01:25] (03CR) 10Andrew Bogott: [C: 03+2] Openstack Designate eqiad1 -> version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/875435 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [21:01:42] it shouldn't affect anything [21:05:03] !log starting UTC late backport window [21:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:18] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874957 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:05:52] (03PS2) 10Stef Dunlap: Start writing to cuc_comment_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874957 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:06:24] Needed to rebase. Give it a moment for CI. [21:07:14] you need to re'+2 since you rebased after TrainBranchBot gave its +2 [21:07:43] kindrobot: you can +2/`scap backport` before the normal tests finish, that usually saves some time especially when CI is as busy as it is right now [21:08:11] Ah, OK. I wasn't sure if I was supposed to wait for them to pass/fail. [21:08:44] Looks like it passed anyways. ;) [21:08:53] (03CR) 10TrainBranchBot: "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874957 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:10:27] (03Merged) 10jenkins-bot: Start writing to cuc_comment_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/874957 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [21:10:54] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:874957|Start writing to cuc_comment_id on group0 and group1 wikis (T233004)]] [21:10:57] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:12:40] !log kindrobot@deploy1002 kindrobot and zabe: Backport for [[gerrit:874957|Start writing to cuc_comment_id on group0 and group1 wikis (T233004)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [21:13:07] zabe, could you confirm the changes? [21:14:30] could you do a query for me? 'select * from cu_changes where cuc_user_text="Zabe" order by cuc_id desc limit 1\G' on metawiki? [21:14:41] Sure, one sec. [21:14:53] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 158 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:15:41] 10SRE, 10Thumbor, 10Thumbor Migration, 10serviceops, 10User-jijiki: Upgrade Thumbor to Bullseye - https://phabricator.wikimedia.org/T216815 (10VirginiaPoundstone) [21:15:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P42820 and previous config saved to /var/cache/conftool/dbconfig/20230104-211542-marostegui.json [21:16:29] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:18:23] zabe, https://phabricator.wikimedia.org/P42821 [21:19:15] kindrobot, sorry, but could run the query on metawiki? It seems like you ran it on enwiki. [21:19:25] Sure [21:19:36] Oh, sorry. Missed that part. :o [21:20:21] no worries :) [21:20:50] https://phabricator.wikimedia.org/P42822 [21:21:06] zabe, i'm here now (fell alseep before) [21:21:40] kindrobot, lgtm [21:22:03] Thanks, continuing the sync [21:26:01] (CirrusSearchHighOldGCFrequency) firing: (2) Elasticsearch instance cloudelastic1006-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:27:23] oh, sorry, that was supposed to have been directed at kindrobot, i see [21:28:22] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:874957|Start writing to cuc_comment_id on group0 and group1 wikis (T233004)]] (duration: 17m 28s) [21:28:26] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:29:22] Thanks Jhs. Would you mind rebasing your patch while we're waiting? [21:29:24] (03PS1) 10Zabe: Start writing to cuc_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875438 (https://phabricator.wikimedia.org/T233004) [21:30:07] kindrobot, alright [21:30:11] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:30:21] (03PS2) 10Jon Harald Søby: Add namespace to gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875386 (https://phabricator.wikimedia.org/T326253) [21:30:50] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P42823 and previous config saved to /var/cache/conftool/dbconfig/20230104-213049-marostegui.json [21:31:32] (03PS1) 10Ahmon Dancy: wmfdebug 0.0.6: Include the wmf-certificates package [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/875439 [21:31:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875386 (https://phabricator.wikimedia.org/T326253) (owner: 10Jon Harald Søby) [21:32:19] 10SRE, 10Traffic: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) >>! In T325557#8499972, @BCornwall wrote: > Another data point: It seems that some metrics are lost when moving to v2: > > https://github.com/google/cadvisor/issues/3062 > >> On nodes running cg... [21:32:51] (03Merged) 10jenkins-bot: Add namespace to gorwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/875386 (https://phabricator.wikimedia.org/T326253) (owner: 10Jon Harald Søby) [21:33:15] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:875386|Add namespace to gorwiktionary (T326253)]] [21:33:18] T326253: Add Indeks namespace to gorwiktionary - https://phabricator.wikimedia.org/T326253 [21:35:01] !log kindrobot@deploy1002 kindrobot and jhsoby: Backport for [[gerrit:875386|Add namespace to gorwiktionary (T326253)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:35:25] Jhs: could you confirm? [21:35:49] looks right, but the script needs to be run as well to be sure everything is fine [21:36:29] It's safe to run the script before syncing to the rest of the servers? [21:37:01] hmm, not sure, sorry [21:37:45] it wouldn't do any harm if it's run before the change is synced, but it might also not hae the desired effect if it isn't synced [21:37:47] kindrobot: you need to sync first, otherwise mwmaint1002 doesn't have the change [21:38:07] I see. Thanks taavi. :) [21:38:08] what taavi said ^^ [21:38:41] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:38:53] Ok, continuing sync [21:40:37] While that's going, for MatmaRex's do I need to run with any special considerations if its a long running script (i.e. run it in screen/tmux), tavi? Or does the job management happen outside the shell? [21:41:06] matmarex said earlier that Amir1 is already running it so you don't have to [21:41:20] yeah [21:41:21] I'm on it [21:41:53] but in general, long running scripts need to run in screen/tmux [21:42:08] Thanks. :) [21:42:16] there's also https://phabricator.wikimedia.org/project/view/2670/, which is the "proper" way to request scripts to be ran instead of (ab)using backport windows [21:43:21] Ah, OK. [21:44:02] When ends up running those scripts. SRE? [21:44:09] *Who [21:44:42] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:875386|Add namespace to gorwiktionary (T326253)]] (duration: 11m 26s) [21:44:45] T326253: Add Indeks namespace to gorwiktionary - https://phabricator.wikimedia.org/T326253 [21:44:57] sadly that apparently isn't really defined [21:45:06] Running the script now. [21:45:07] whoever has access and sees the ticket, so in practice usually u.rbanecm or r.eedy [21:45:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T326011)', diff saved to https://phabricator.wikimedia.org/P42824 and previous config saved to /var/cache/conftool/dbconfig/20230104-214555-marostegui.json [21:45:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [21:45:59] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [21:46:11] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db1198.eqiad.wmnet with reason: Maintenance [21:46:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T326011)', diff saved to https://phabricator.wikimedia.org/P42825 and previous config saved to /var/cache/conftool/dbconfig/20230104-214616-marostegui.json [21:46:25] (03CR) 10Ahmon Dancy: [C: 03+1] "This is good to go." [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy) [21:46:27] kindrobot, the script command i put is a dry run; if it looks fine, it can be re-run with --fix to actually do what it's supposed to do [21:46:45] OK, great. I was just about to ask you. [21:47:04] It said "Looks good!" so I'm going to run it with --fix [21:47:18] nice [21:48:21] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:48:22] Script has run, would you like the output? [21:48:25] !log mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki mediawikiwiki "African Wikimedia Technical Community/Project Scope" "Africa Wikimedia Technical Community/Project Scope" "Taavi" --reason "per request [[:phab:T318292]]" # T318292 [21:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:29] everything looks good now on the live wiki as far as i can tell [21:48:29] T318292: Move African Wikimedia Technical Community/Project Scope to Africa Wikimedia Technical Community/Project Scope - https://phabricator.wikimedia.org/T318292 [21:48:36] kindrobot, nah, not necessary [21:48:44] Ok, great. [21:49:03] tgr: you ready? [21:49:17] yes [21:49:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875371 (https://phabricator.wikimedia.org/T301096) (owner: 10Gergő Tisza) [21:49:50] I can test in production, it's behind a feature flag. [21:50:14] kindrobot: I'd do both of those at the same time to save time, given that the CI will probably take a while [21:50:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T326011)', diff saved to https://phabricator.wikimedia.org/P42826 and previous config saved to /var/cache/conftool/dbconfig/20230104-215025-marostegui.json [21:51:05] If I already said "y" to "Backport the changes?" am I too late? [21:51:33] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 17 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:51:36] If it is waiting for something to merge, you can control-c and start over [21:51:39] no, you can ctrl+c until it merges [21:51:48] !log kindrobot@deploy1002 backport aborted: (duration: 02m 12s) [21:52:21] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875371 (https://phabricator.wikimedia.org/T301096) (owner: 10Gergő Tisza) [21:52:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kindrobot@deploy1002 using scap backport" [extensions/GrowthExperiments] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875372 (https://phabricator.wikimedia.org/T301096) (owner: 10Gergő Tisza) [21:54:15] Thanks taavi and dancy, for this and all you help today. It take a village (to raise a baby deployer) [22:05:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P42827 and previous config saved to /var/cache/conftool/dbconfig/20230104-220532-marostegui.json [22:06:35] (03Merged) 10jenkins-bot: Fix underlinkedness rescore logic [extensions/GrowthExperiments] (wmf/1.40.0-wmf.14) - 10https://gerrit.wikimedia.org/r/875371 (https://phabricator.wikimedia.org/T301096) (owner: 10Gergő Tisza) [22:11:33] (03Merged) 10jenkins-bot: Fix underlinkedness rescore logic [extensions/GrowthExperiments] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875372 (https://phabricator.wikimedia.org/T301096) (owner: 10Gergő Tisza) [22:11:58] !log kindrobot@deploy1002 Started scap: Backport for [[gerrit:875371|Fix underlinkedness rescore logic (T301096)]], [[gerrit:875372|Fix underlinkedness rescore logic (T301096)]] [22:12:02] T301096: Add a link: prioritize suggestions of underlinked articles - https://phabricator.wikimedia.org/T301096 [22:13:48] !log kindrobot@deploy1002 kindrobot and tgr: Backport for [[gerrit:875371|Fix underlinkedness rescore logic (T301096)]], [[gerrit:875372|Fix underlinkedness rescore logic (T301096)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [22:14:37] tgr, I know you said you can test it in production, but would you mind confirming everything is OK/not-broken around the feature on debug before I sync it? [22:15:57] I can but it will take a few minutes [22:17:28] No rush, and doesn't have to be exhaustive. (You can do more testing in production.) I just want to do a little due diligence before shipping it. [22:20:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P42828 and previous config saved to /var/cache/conftool/dbconfig/20230104-222038-marostegui.json [22:21:05] kindrobot: confirmed [22:21:27] Thanks! Syncing now. [22:27:18] !log kindrobot@deploy1002 Finished scap: Backport for [[gerrit:875371|Fix underlinkedness rescore logic (T301096)]], [[gerrit:875372|Fix underlinkedness rescore logic (T301096)]] (duration: 15m 20s) [22:27:22] T301096: Add a link: prioritize suggestions of underlinked articles - https://phabricator.wikimedia.org/T301096 [22:27:43] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:27:45] !log finished UTC late backport window [22:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:27:52] Thanks everyone! o/ [22:28:57] thank you! [22:31:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:35:45] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T326011)', diff saved to https://phabricator.wikimedia.org/P42831 and previous config saved to /var/cache/conftool/dbconfig/20230104-223545-marostegui.json [22:35:47] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:35:49] T326011: Add default values to cul_user and cul_user_text - https://phabricator.wikimedia.org/T326011 [22:35:50] (03CR) 10Dzahn: [C: 03+2] train-presync: Pass -Dfull_image_build:True to scap stage-train [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy) [22:36:00] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:42:52] (03PS5) 10Dzahn: deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) [22:43:24] jhathaway: you got a pending change on puppetmaster [22:43:44] mutante: thanks... [22:44:08] mutante: merged [22:45:38] jhathaway: ack, thanks [22:51:55] (03CR) 10Dzahn: [C: 03+2] "the systemd unit command line has been changed on deploy1002" [puppet] - 10https://gerrit.wikimedia.org/r/869333 (https://phabricator.wikimedia.org/T325576) (owner: 10Ahmon Dancy) [22:52:11] (03CR) 10Dzahn: [C: 03+2] deployment_server: add keyholder/group config for jenkins-ci deploy [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [22:56:33] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [22:57:38] (03CR) 10Dzahn: [C: 03+2] "worked. file /etc/keyholder.d/deploy_jenkins.pub" [puppet] - 10https://gerrit.wikimedia.org/r/868736 (https://phabricator.wikimedia.org/T324014) (owner: 10Dzahn) [22:58:00] yea, the unarmed keyholder alert makes sense because I just added a new key [23:00:03] !log deploy1002 - re-arming keyholder T324014 [23:00:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:07] T324014: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 [23:00:33] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) ` Identity added: /etc/keyholder.d/deploy_jenkins (/etc/keyholder.d/deploy_jenkins) ` [23:01:09] !log deploy2002 - re-arming keyholder T324014 [23:01:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:01:33] (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [23:03:14] 10SRE, 10SRE-Access-Requests, 10Continuous-Integration-Infrastructure, 10serviceops-collab, and 2 others: New Keyholder identity for RelEng Jenkins service - https://phabricator.wikimedia.org/T324014 (10Dzahn) 05In progress→03Resolved @jnuche This is done now. on both deployment server keyholder has be... [23:04:44] (03CR) 10Dzahn: "there is one change in the compiler output I did not anticipate. owner is changed from "phd" to "920", the numeric ID." [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:14:23] (03PS1) 10Zabe: actions: Pass CommentFormatter to McrRestoreAction [core] (wmf/1.40.0-wmf.17) - 10https://gerrit.wikimedia.org/r/875379 (https://phabricator.wikimedia.org/T326275) [23:16:11] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:17:45] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 9 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:23:53] (03PS3) 10Dzahn: phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:23:55] (03PS1) 10Dzahn: admin: add data type for UIDs [puppet] - 10https://gerrit.wikimedia.org/r/875446 [23:36:02] (03PS4) 10Dzahn: phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:36:41] (03CR) 10CI reject: [V: 04-1] phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:38:07] (03PS5) 10Dzahn: phabricator: dedupe phd user creation [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:45:03] (03CR) 10Dzahn: "I made some small-ish changes so that the home dir stays owned by "phd" and not "920" as before and moving home_dir, UID, GID to actual cl" [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:46:36] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "compiler diff is now more "noop"-ish than before" [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:48:42] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "Error: Found 1 dependency cycle:" [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:57:08] (03PS1) 10Dzahn: phabricator: solve dependency cycle for sysyser phd [puppet] - 10https://gerrit.wikimedia.org/r/875449 (https://phabricator.wikimedia.org/T326146) [23:57:19] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/875449/" [puppet] - 10https://gerrit.wikimedia.org/r/875265 (https://phabricator.wikimedia.org/T326146) (owner: 10Hashar) [23:58:12] (03CR) 10Dzahn: [C: 03+2] phabricator: solve dependency cycle for sysyser phd [puppet] - 10https://gerrit.wikimedia.org/r/875449 (https://phabricator.wikimedia.org/T326146) (owner: 10Dzahn)