[00:01:11] (03PS2) 10RLazarus: deployment_server: Add --file to mwscript-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1085507 (https://phabricator.wikimedia.org/T376230) [00:01:16] (03CR) 10RLazarus: deployment_server: Add --file to mwscript-k8s (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1085507 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [00:03:52] (03PS3) 10RLazarus: mediawiki: Support copying text files into mw-script containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) [00:04:25] (03CR) 10RLazarus: mediawiki: Support copying text files into mw-script containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [00:17:36] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:28:59] (03CR) 10Scott French: [C:03+1] mediawiki: Support copying text files into mw-script containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1085506 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [00:38:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1085674 [00:38:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1085674 (owner: 10TrainBranchBot) [00:46:46] (03CR) 10Scott French: [C:03+1] deployment_server: Add --file to mwscript-k8s (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1085507 (https://phabricator.wikimedia.org/T376230) (owner: 10RLazarus) [01:08:45] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085687 [01:08:45] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085687 (owner: 10TrainBranchBot) [01:09:50] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1085674 (owner: 10TrainBranchBot) [01:41:51] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1085687 (owner: 10TrainBranchBot) [01:49:40] I'm getting DB timeouts when trying to look at user creation logs from 2014 on enwiki. Is that meant to timeout? [01:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [02:37:36] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:02:36] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:17:36] FIRING: [2x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:39:17] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [05:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [06:39:15] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [08:07:36] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:01] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 200181288 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:52:01] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 5446888 and 0 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [10:50:36] 06SRE-OnFire, 10Incident Tooling: Corto: ensure Phabricator tasks are created with correct default visibility & priority - https://phabricator.wikimedia.org/T376500#10286143 (10Aklapper) @corto could set view policy and edit policy to #acl_sre-team (`PHID-PROJ-fqb3bhereyljcqrsbju7`) when calling https://phabri... [11:31:10] (03PS1) 10Ladsgroup: Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085922 (https://phabricator.wikimedia.org/T184386) [12:07:36] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:12:05] PROBLEM - Ensure traffic_manager is running for instance backend on cp1103 is CRITICAL: PROCS CRITICAL: 2 processes with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [12:13:05] RECOVERY - Ensure traffic_manager is running for instance backend on cp1103 is OK: PROCS OK: 1 process with args /usr/bin/traffic_manager --nosyslog https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:33:09] PROBLEM - BGP status on cr2-magru is CRITICAL: BGP CRITICAL - No response from remote host 195.200.68.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [14:26:27] PROBLEM - Host ganeti2042 is DOWN: PING CRITICAL - Packet loss = 100% [14:31:19] RECOVERY - Host ganeti2042 is UP: PING OK - Packet loss = 0%, RTA = 30.38 ms [14:37:36] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:00:49] (03PS3) 10Reedy: InitialiseSettings.php: Fix comment about $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077467 [15:00:52] (03CR) 10Reedy: [C:03+2] InitialiseSettings.php: Fix comment about $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077467 (owner: 10Reedy) [15:01:15] (03PS6) 10Reedy: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 [15:01:26] (03CR) 10Reedy: [C:03+2] MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy) [15:01:34] (03Merged) 10jenkins-bot: InitialiseSettings.php: Fix comment about $wgCrossSiteAJAXdomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1077467 (owner: 10Reedy) [15:02:13] (03Merged) 10jenkins-bot: MetaContactPages: Minor comment tweaks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1075280 (owner: 10Reedy) [15:02:36] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:25] (03PS2) 10Lucas Werkmeister (WMDE): Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085922 (https://phabricator.wikimedia.org/T184386) (owner: 10Ladsgroup) [15:13:15] !log reedy@deploy2002 Synchronized wmf-config/: Comment updates (duration: 07m 31s) [15:13:30] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "I updated the commit message to include more information (with help from Nikki); with that, this looks okay to deploy to me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085922 (https://phabricator.wikimedia.org/T184386) (owner: 10Ladsgroup) [15:16:18] (03PS6) 10Reedy: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 [15:16:26] (03CR) 10Reedy: [C:03+2] Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy) [15:17:17] (03Merged) 10jenkins-bot: Use more use statements rather than inline FQN [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1068107 (owner: 10Reedy) [15:18:35] (03CR) 10Ladsgroup: [C:03+1] Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085922 (https://phabricator.wikimedia.org/T184386) (owner: 10Ladsgroup) [15:19:33] !log reedy@deploy2002 Started scap sync-world: use statemnts [15:19:33] I’ll deploy ^ soon if nobody objects [15:19:41] (once Reedy is done ^^) [15:26:46] !log reedy@deploy2002 Finished scap sync-world: use statemnts (duration: 07m 13s) [15:29:04] Reedy: okay for me to deploy? [15:34:59] (approval received from Amir1 in person, FTR :P) [15:35:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085922 (https://phabricator.wikimedia.org/T184386) (owner: 10Ladsgroup) [15:35:15] test plan: e.g. https://www.wikidata.org/?uselang=ig should become unbroken [15:35:23] (and others linked in https://phabricator.wikimedia.org/T184386#9410079) [15:35:32] you can still yell at me to stop deploying if needed [15:36:03] (03Merged) 10jenkins-bot: Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1085922 (https://phabricator.wikimedia.org/T184386) (owner: 10Ladsgroup) [15:36:35] !log lucaswerkmeister-wmde@deploy2002 Started scap sync-world: Backport for [[gerrit:1085922|Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata (T184386)]] [15:38:59] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ladsgroup: Backport for [[gerrit:1085922|Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata (T184386)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:41:51] looks good to me on mwdebug [15:42:03] the sidebar link still points to the wrong URL in some cases, that’s probably caching [15:42:10] but at least it shows the correct main page now, that’s already an improvement [15:42:20] I shall go ahead and deploy this if nobody stops me in a moment [15:44:03] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde, ladsgroup: Continuing with sync [15:48:45] !log lucaswerkmeister-wmde@deploy2002 Finished scap sync-world: Backport for [[gerrit:1085922|Remove 'mainpage' from $wgForceUIMsgAsContentMsg for Wikidata (T184386)]] (duration: 12m 09s) [15:53:30] * Lucas_WMDE done deploying [16:07:36] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [18:32:36] FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:33:57] FIRING: [5x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:48:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:51:49] FIRING: PuppetDisabled: Puppet disabled on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=misc&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [21:52:37] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:37:36] FIRING: [4x] ProbeDown: Service aqs1013-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:48:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:52:37] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state