[00:00:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P25204 and previous config saved to /var/cache/conftool/dbconfig/20220419-000057-ladsgroup.json [00:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:03:58] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01055 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [00:05:09] ^ bug [00:05:25] should be RESOLVED because contint1001 puppet run was just fixed [00:05:57] "no resources reported" is more like a good thing in this case? [00:06:30] and 0.01055 ge 0.01 .. yea :p [00:08:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25205 and previous config saved to /var/cache/conftool/dbconfig/20220419-000805-ladsgroup.json [00:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:08:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:09:41] (03PS8) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 [00:09:43] (03CR) 10Ebernhardson: elastic: Restart masters one at a time after all others (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (owner: 10Ebernhardson) [00:16:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T298565)', diff saved to https://phabricator.wikimedia.org/P25206 and previous config saved to /var/cache/conftool/dbconfig/20220419-001602-ladsgroup.json [00:16:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [00:16:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [00:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:16:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T298565)', diff saved to https://phabricator.wikimedia.org/P25207 and previous config saved to /var/cache/conftool/dbconfig/20220419-001610-ladsgroup.json [00:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:16:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298565)', diff saved to https://phabricator.wikimedia.org/P25208 and previous config saved to /var/cache/conftool/dbconfig/20220419-002126-ladsgroup.json [00:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:21:31] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:23:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P25209 and previous config saved to /var/cache/conftool/dbconfig/20220419-002310-ladsgroup.json [00:23:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:36:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P25210 and previous config saved to /var/cache/conftool/dbconfig/20220419-003631-ladsgroup.json [00:36:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [00:38:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P25211 and previous config saved to /var/cache/conftool/dbconfig/20220419-003815-ladsgroup.json [00:38:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:51:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P25212 and previous config saved to /var/cache/conftool/dbconfig/20220419-005136-ladsgroup.json [00:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25213 and previous config saved to /var/cache/conftool/dbconfig/20220419-005320-ladsgroup.json [00:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [00:53:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [00:53:30] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [00:53:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:53:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25214 and previous config saved to /var/cache/conftool/dbconfig/20220419-005334-ladsgroup.json [00:53:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:23] RECOVERY - MariaDB Replica Lag: s6 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:00:04] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T0100) [01:02:16] !log turning on general logging in pc1012 (pc2) (T285993) [01:02:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:02:21] T285993: [SPIKE] Estimate growth in demand for Parser Cache storage - https://phabricator.wikimedia.org/T285993 [01:03:39] !log turning off general logging in pc1012 (pc2) (T285993) [01:03:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25215 and previous config saved to /var/cache/conftool/dbconfig/20220419-010438-ladsgroup.json [01:04:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:04:42] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:06:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T298565)', diff saved to https://phabricator.wikimedia.org/P25216 and previous config saved to /var/cache/conftool/dbconfig/20220419-010641-ladsgroup.json [01:06:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [01:06:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [01:06:47] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:06:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:50] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [01:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:06:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T298565)', diff saved to https://phabricator.wikimedia.org/P25217 and previous config saved to /var/cache/conftool/dbconfig/20220419-010654-ladsgroup.json [01:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298565)', diff saved to https://phabricator.wikimedia.org/P25218 and previous config saved to /var/cache/conftool/dbconfig/20220419-011112-ladsgroup.json [01:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:11:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:17:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:18:18] (ProbeDown) firing: (8) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:18:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:19:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P25219 and previous config saved to /var/cache/conftool/dbconfig/20220419-011943-ladsgroup.json [01:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:19:51] PROBLEM - Apache HTTP on mw1350 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [01:20:17] PROBLEM - High average POST latency for mw requests on appserver in eqiad on alert1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [01:22:40] RECOVERY - High average POST latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [01:22:57] I'm around, making a lot of queries to pc2 [01:23:18] (ProbeDown) resolved: (22) Service appservers-https:443 has failed probes (http_appservers-https_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=http - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:24:30] RECOVERY - High average GET latency for mw requests on appserver in eqiad on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:26:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P25220 and previous config saved to /var/cache/conftool/dbconfig/20220419-012617-ladsgroup.json [01:26:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:26:55] (LogstashIngestSpike) firing: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [01:27:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [01:31:55] (LogstashIngestSpike) resolved: Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [01:34:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P25221 and previous config saved to /var/cache/conftool/dbconfig/20220419-013448-ladsgroup.json [01:34:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:36:12] PROBLEM - Check systemd state on doc1001 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc1002.eqiad.wmnet.service,rsync-doc-doc2001.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:39:30] RECOVERY - Apache HTTP on mw1350 is OK: HTTP OK: HTTP/1.1 302 Found - 634 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Application_servers [01:41:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P25222 and previous config saved to /var/cache/conftool/dbconfig/20220419-014122-ladsgroup.json [01:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:47:53] !log [doc1001:~] $ sudo systemctl start rsync-doc-doc1002.eqiad.wmnet [01:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:47:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:48:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:49:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25223 and previous config saved to /var/cache/conftool/dbconfig/20220419-014953-ladsgroup.json [01:49:55] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [01:49:56] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [01:49:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:49:57] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:49:58] PROBLEM - Varnish traffic drop between 30min ago and now at esams on alert1001 is CRITICAL: 29.47 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:50:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:50:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:54:20] RECOVERY - Varnish traffic drop between 30min ago and now at esams on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/d/000000180/varnish-http-requests?orgId=1&viewPanel=6 [01:56:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T298565)', diff saved to https://phabricator.wikimedia.org/P25224 and previous config saved to /var/cache/conftool/dbconfig/20220419-015627-ladsgroup.json [01:56:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [01:56:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [01:56:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [01:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P25225 and previous config saved to /var/cache/conftool/dbconfig/20220419-015635-ladsgroup.json [01:56:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:56:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:57:30] PROBLEM - restbase endpoints health on restbase-dev1005 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [01:58:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:58:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [01:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:59:44] RECOVERY - restbase endpoints health on restbase-dev1005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [02:06:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [02:06:59] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [02:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25226 and previous config saved to /var/cache/conftool/dbconfig/20220419-020703-ladsgroup.json [02:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:07:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.39.0-wmf.8 [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783887 [02:07:19] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.39.0-wmf.8 [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783887 (owner: 10TrainBranchBot) [02:07:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:52] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:07:53] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:07:57] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25227 and previous config saved to /var/cache/conftool/dbconfig/20220419-021901-ladsgroup.json [02:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:19:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [02:21:28] RECOVERY - Check systemd state on doc1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:22:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [02:23:47] (03Merged) 10jenkins-bot: Branch commit for wmf/1.39.0-wmf.8 [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783887 (owner: 10TrainBranchBot) [02:27:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:28:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [02:28:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [02:28:28] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [02:28:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:32] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [02:28:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:31:32] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [02:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [02:34:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P25228 and previous config saved to /var/cache/conftool/dbconfig/20220419-023406-ladsgroup.json [02:34:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:36:14] RECOVERY - MariaDB Replica Lag: s3 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [02:49:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P25229 and previous config saved to /var/cache/conftool/dbconfig/20220419-024911-ladsgroup.json [02:49:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P25230 and previous config saved to /var/cache/conftool/dbconfig/20220419-025649-ladsgroup.json [02:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:56:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:04:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25231 and previous config saved to /var/cache/conftool/dbconfig/20220419-030416-ladsgroup.json [03:04:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [03:04:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [03:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:04:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298565)', diff saved to https://phabricator.wikimedia.org/P25232 and previous config saved to /var/cache/conftool/dbconfig/20220419-030424-ladsgroup.json [03:04:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:11:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P25233 and previous config saved to /var/cache/conftool/dbconfig/20220419-031154-ladsgroup.json [03:11:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298565)', diff saved to https://phabricator.wikimedia.org/P25234 and previous config saved to /var/cache/conftool/dbconfig/20220419-031501-ladsgroup.json [03:15:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:15:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:15:44] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [03:17:13] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [03:17:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [03:17:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [03:17:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [03:17:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:27:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P25235 and previous config saved to /var/cache/conftool/dbconfig/20220419-032659-ladsgroup.json [03:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:28:14] (03CR) 10KartikMistry: TTMServerAid::getData: Do not swallow TranslationHelperException (031 comment) [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/780641 (https://phabricator.wikimedia.org/T306233) (owner: 10KartikMistry) [03:30:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P25236 and previous config saved to /var/cache/conftool/dbconfig/20220419-033006-ladsgroup.json [03:30:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:32:42] RECOVERY - MariaDB Replica Lag: s5 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [03:42:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T298565)', diff saved to https://phabricator.wikimedia.org/P25237 and previous config saved to /var/cache/conftool/dbconfig/20220419-034204-ladsgroup.json [03:42:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:42:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [03:45:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P25238 and previous config saved to /var/cache/conftool/dbconfig/20220419-034512-ladsgroup.json [03:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:47:20] RECOVERY - MariaDB Replica Lag: s7 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:00:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298565)', diff saved to https://phabricator.wikimedia.org/P25239 and previous config saved to /var/cache/conftool/dbconfig/20220419-040017-ladsgroup.json [04:00:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [04:00:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [04:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:22] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298565)', diff saved to https://phabricator.wikimedia.org/P25240 and previous config saved to /var/cache/conftool/dbconfig/20220419-040024-ladsgroup.json [04:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:00:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298565)', diff saved to https://phabricator.wikimedia.org/P25241 and previous config saved to /var/cache/conftool/dbconfig/20220419-041120-ladsgroup.json [04:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:11:25] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:14:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:14:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [04:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:14:48] (03CR) 10Abijeet Patro: [C: 03+1] "recheck" [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/780641 (https://phabricator.wikimedia.org/T306233) (owner: 10KartikMistry) [04:14:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:15:34] RECOVERY - MariaDB Replica Lag: s2 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [04:26:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P25242 and previous config saved to /var/cache/conftool/dbconfig/20220419-042625-ladsgroup.json [04:26:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [04:41:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P25243 and previous config saved to /var/cache/conftool/dbconfig/20220419-044130-ladsgroup.json [04:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298565)', diff saved to https://phabricator.wikimedia.org/P25244 and previous config saved to /var/cache/conftool/dbconfig/20220419-045635-ladsgroup.json [04:56:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:56:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [04:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:56:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:45] !log root@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 25 hosts with reason: Primary switchover s7 T306001 [04:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:57:48] T306001: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T306001 [04:58:01] !log root@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 25 hosts with reason: Primary switchover s7 T306001 [04:58:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:58:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set db1181 with weight 0 T306001', diff saved to https://phabricator.wikimedia.org/P25245 and previous config saved to /var/cache/conftool/dbconfig/20220419-045814-root.json [04:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:56] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [05:03:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [05:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [05:05:18] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [05:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25246 and previous config saved to /var/cache/conftool/dbconfig/20220419-050523-ladsgroup.json [05:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:05:27] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:08:02] (03PS2) 10Marostegui: mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/783819 (https://phabricator.wikimedia.org/T306001) [05:09:16] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:09:45] !log dbmaint s3@eqiad T306269 [05:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:09:50] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [05:15:22] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:16:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25247 and previous config saved to /var/cache/conftool/dbconfig/20220419-051608-ladsgroup.json [05:16:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:16:13] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [05:16:17] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1181 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/783819 (https://phabricator.wikimedia.org/T306001) (owner: 10Marostegui) [05:17:36] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:29:23] (03PS1) 10Marostegui: db1136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/784078 (https://phabricator.wikimedia.org/T302363) [05:29:59] (03CR) 10Marostegui: [C: 03+2] db1136: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/784078 (https://phabricator.wikimedia.org/T302363) (owner: 10Marostegui) [05:31:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P25248 and previous config saved to /var/cache/conftool/dbconfig/20220419-053113-ladsgroup.json [05:31:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:46:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P25249 and previous config saved to /var/cache/conftool/dbconfig/20220419-054618-ladsgroup.json [05:46:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:56:02] I stopped templatelinks backfill in s7 [05:56:58] Amir1: excellent [05:56:59] (03PS1) 10Abijeet Patro: ElasticSearchTTMServer: tie break on wiki+localid [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/783907 (https://phabricator.wikimedia.org/T305428) [05:57:22] Amir1: we'll have another dbctl change in 15 minutes, which might collide with the switchover [05:57:29] so I will wait till 06:01 or so to start [05:57:34] (03PS2) 10Abijeet Patro: ElasticSearchTTMServer: tie break on wiki+localid [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/783907 (https://phabricator.wikimedia.org/T305428) [05:57:39] the db1144:3311 repooling [05:57:42] sure sgtm [06:00:04] kormat, marostegui, and Amir1: How many deployers does it take to do Primary database switchover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T0600). [06:00:10] o/ [06:00:24] Waiting for a minute until the dbctl coming from the db1144:3314 repooling has happened [06:00:32] otherwise it will collide with the switchover dbctl [06:00:57] o/ [06:01:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25250 and previous config saved to /var/cache/conftool/dbconfig/20220419-060123-ladsgroup.json [06:01:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [06:01:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [06:01:27] * Amir1 waves at kormat [06:01:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:01:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P25251 and previous config saved to /var/cache/conftool/dbconfig/20220419-060131-ladsgroup.json [06:01:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:42] ok, starting [06:01:43] !log Starting s7 eqiad failover from db1136 to db1181 - T306001 [06:01:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:47] T306001: Switchover s7 master (db1136 -> db1181) - https://phabricator.wikimedia.org/T306001 [06:01:58] !log marostegui@cumin1001 dbctl commit (dc=all): 'Set s7 eqiad as read-only for maintenance - T306001', diff saved to https://phabricator.wikimedia.org/P25252 and previous config saved to /var/cache/conftool/dbconfig/20220419-060157-marostegui.json [06:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Promote db1181 to s7 primary and set section read-write T306001', diff saved to https://phabricator.wikimedia.org/P25253 and previous config saved to /var/cache/conftool/dbconfig/20220419-060226-marostegui.json [06:02:29] All done [06:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:33] \o/ [06:02:44] let's test please [06:02:50] eswiki looks good [06:02:56] Amir1: anything that can be tested on centralauth? [06:03:18] account creation [06:03:42] I was testing fawiki and it works fine as well [06:03:56] restarting backfill now [06:04:48] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/783820 (https://phabricator.wikimedia.org/T306001) (owner: 10Marostegui) [06:04:53] marostegui: want me to cleanup heartbeat? [06:04:58] kormat: I did it already [06:05:05] (03CR) 10Abijeet Patro: [C: 03+1] "The https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/147379/console CI failure is due to T305931. Its" [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/780641 (https://phabricator.wikimedia.org/T306233) (owner: 10KartikMistry) [06:05:13] marostegui: fiine [06:05:59] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1136 T306001', diff saved to https://phabricator.wikimedia.org/P25254 and previous config saved to /var/cache/conftool/dbconfig/20220419-060559-marostegui.json [06:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:06:16] marostegui: https://meta.wikimedia.org/wiki/Special:GlobalRenameProgress/Wnukui [06:06:23] I did a user rename [06:06:28] Amir1: worked fine? [06:06:34] seems so far [06:06:37] excellent [06:06:54] yup [06:11:54] !log dbmaint s7@eqiad T302658 [06:11:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:11:58] T302658: globaluser table schema changes (March 2022) - https://phabricator.wikimedia.org/T302658 [06:12:06] (03CR) 10jerkins-bot: [V: 04-1] ElasticSearchTTMServer: tie break on wiki+localid [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/783907 (https://phabricator.wikimedia.org/T305428) (owner: 10Abijeet Patro) [06:12:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:12:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1102.eqiad.wmnet with reason: Maintenance [06:12:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:12:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P25255 and previous config saved to /var/cache/conftool/dbconfig/20220419-061310-ladsgroup.json [06:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:13:35] !log dbmaint s7@eqiad T300381 [06:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:41] T300381: Make page_props.pp_page unsigned on wmf wikis - https://phabricator.wikimedia.org/T300381 [06:14:32] RECOVERY - MariaDB Replica Lag: s4 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [06:17:39] (03CR) 10Abijeet Patro: [C: 03+1] "The https://integration.wikimedia.org/ci/job/quibble-vendor-mysql-php72-noselenium-docker/147379/console CI failure is due to T305931. Its" [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/783907 (https://phabricator.wikimedia.org/T305428) (owner: 10Abijeet Patro) [06:18:32] !log dbmaint s7@eqiad T298557 [06:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:18:36] T298557: Fix mismatching field type of page.page_touched on wmf wikis - https://phabricator.wikimedia.org/T298557 [06:22:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [06:27:34] (03PS1) 10Ayounsi: drmrs: add border filter to GREs [homer/public] - 10https://gerrit.wikimedia.org/r/784082 (https://phabricator.wikimedia.org/T303152) [06:28:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P25256 and previous config saved to /var/cache/conftool/dbconfig/20220419-062815-ladsgroup.json [06:28:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:28:55] (03CR) 10Ayounsi: [C: 03+2] drmrs: add border filter to GREs [homer/public] - 10https://gerrit.wikimedia.org/r/784082 (https://phabricator.wikimedia.org/T303152) (owner: 10Ayounsi) [06:29:29] (03Merged) 10jenkins-bot: drmrs: add border filter to GREs [homer/public] - 10https://gerrit.wikimedia.org/r/784082 (https://phabricator.wikimedia.org/T303152) (owner: 10Ayounsi) [06:30:15] !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox [06:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:33:41] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/778477 (owner: 10David Caro) [06:34:31] (03PS2) 10Muehlenhoff: Enable profile::auto_restarts::service for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/775298 (https://phabricator.wikimedia.org/T135991) [06:35:35] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:35:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:37:18] !log drmrs: add tunnels to Cloudflare - T303152 [06:37:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:32] !log eqiad: add missing Cloudflare route [06:41:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:43:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P25257 and previous config saved to /var/cache/conftool/dbconfig/20220419-064320-ladsgroup.json [06:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:42] !log dbmaint s7@eqiad T298563 [06:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:47] T298563: Fix mismatching field type of column text.old_flags on wmf wikis - https://phabricator.wikimedia.org/T298563 [06:51:44] !log dbmaint s7@eqiad T305300 [06:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:49] T305300: Add lu_attachment_method column to localuser table - https://phabricator.wikimedia.org/T305300 [06:54:06] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:54:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1165.eqiad.wmnet with reason: Maintenance [06:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:09] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:13] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [06:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1165 (T306269)', diff saved to https://phabricator.wikimedia.org/P25258 and previous config saved to /var/cache/conftool/dbconfig/20220419-065417-marostegui.json [06:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:21] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [06:56:17] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T306269)', diff saved to https://phabricator.wikimedia.org/P25259 and previous config saved to /var/cache/conftool/dbconfig/20220419-065617-marostegui.json [06:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:42] !log dbmaint s7@eqiad T298554 [06:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:46] T298554: Fix mismatching field type of archive.ar_timestamp on wmf wikis - https://phabricator.wikimedia.org/T298554 [06:58:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P25260 and previous config saved to /var/cache/conftool/dbconfig/20220419-065825-ladsgroup.json [06:58:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:58:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [06:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:29] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [06:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:34] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P25261 and previous config saved to /var/cache/conftool/dbconfig/20220419-065833-ladsgroup.json [06:58:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:52] (03PS1) 10Muehlenhoff: Remove access for jdl [puppet] - 10https://gerrit.wikimedia.org/r/784083 [06:59:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:59:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [06:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:59:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1, awight, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T0700). [07:00:04] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:42] i can deploy today [07:00:44] hello kart_ [07:01:17] urbanecm: hello! [07:01:24] (03CR) 10Elukey: "Hello! I have some doubts related to this change, I'll try to write them down and then we can discuss what's best. My understanding is tha" [puppet] - 10https://gerrit.wikimedia.org/r/779086 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [07:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:01:58] urbanecm: as mentioned in both patches, CI failure is unrelated, so both patches can be merged. [07:02:03] yup, just saw that [07:02:04] thanks [07:02:19] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "ci failure is T305931" [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/780641 (https://phabricator.wikimedia.org/T306233) (owner: 10KartikMistry) [07:02:45] (03CR) 10Urbanecm: [V: 03+2 C: 03+2] "ci failure is T305931" [extensions/Translate] (wmf/1.39.0-wmf.7) - 10https://gerrit.wikimedia.org/r/783907 (https://phabricator.wikimedia.org/T305428) (owner: 10Abijeet Patro) [07:04:22] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for jdl [puppet] - 10https://gerrit.wikimedia.org/r/784083 (owner: 10Muehlenhoff) [07:04:47] kart_: both patches are at mwdebug1001, please have a look [07:05:24] Thanks urbanecm, I'm checking [07:05:29] hi abijeet :) [07:05:46] oh hello :) [07:06:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [07:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [07:06:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [07:06:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [07:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:22] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jason Linehan out of all services on: 442 hosts [07:08:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:32] urbanecm, Looks good to me! [07:08:36] syncing! [07:08:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jason Linehan out of all services on: 442 hosts [07:08:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:54] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging Jason Linehan out of all services on: 1229 hosts [07:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P25262 and previous config saved to /var/cache/conftool/dbconfig/20220419-070913-ladsgroup.json [07:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:09:40] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Jason Linehan out of all services on: 1229 hosts [07:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P25263 and previous config saved to /var/cache/conftool/dbconfig/20220419-071122-marostegui.json [07:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:11:55] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.7/extensions/Translate/ttmserver/ElasticSearchTTMServer.php: e9668719a6eb928b28ef67ba6d97348068012d04: ElasticSearchTTMServer: tie break on wiki+localid (T305428, T306233) (duration: 00m 51s) [07:11:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:00] T306233: Translation memory does not display anymore - https://phabricator.wikimedia.org/T306233 [07:12:00] T305428: Upgrade the Translate TTM Elasticsearch implementation to elasticsearch 6.8 and onwards - https://phabricator.wikimedia.org/T305428 [07:12:17] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [07:12:35] kart_: abijeet: first patch is live [07:12:47] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.7/extensions/Translate/src/TranslatorInterface/Aid/TTMServerAid.php: 36c6682: TTMServerAid::getData: Do not swallow TranslationHelperException (T306233) (duration: 00m 51s) [07:12:48] ...and the second one is as well :) [07:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:12:51] anything else? [07:15:36] @urbanecm, give me a moment [07:16:15] Thanks urbanecm. Will wait till abijeet is confirming.. [07:17:34] @urbanecm, changes look fine. Thanks a lot for your help [07:18:00] No problem. [07:18:58] Thanks urbanecm and abijeet [07:18:59] !log dbmaint s7@eqiad T301848 [07:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:04] T301848: Check for compressed templatelinks tables - https://phabricator.wikimedia.org/T301848 [07:19:22] !log UTC morning B&C window done [07:19:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P25264 and previous config saved to /var/cache/conftool/dbconfig/20220419-072418-ladsgroup.json [07:24:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:26:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P25265 and previous config saved to /var/cache/conftool/dbconfig/20220419-072627-marostegui.json [07:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:21] (03PS1) 10Muehlenhoff: Remove access for keepit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/784085 [07:29:51] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging CGlenn out of all services on: 442 hosts [07:29:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging CGlenn out of all services on: 442 hosts [07:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:30:41] (03PS1) 10Majavah: hieradata: update openstack to use ldap-rw hostnames [puppet] - 10https://gerrit.wikimedia.org/r/784086 (https://phabricator.wikimedia.org/T295150) [07:31:06] !log jmm@cumin2002 START - Cookbook sre.idm.logout Logging CGlenn out of all services on: 1229 hosts [07:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging CGlenn out of all services on: 1229 hosts [07:31:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:33:39] !log moving mr1-eqsin to new router [07:33:41] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34875/console" [puppet] - 10https://gerrit.wikimedia.org/r/784086 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [07:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:50] 10Puppet, 10SRE, 10Infrastructure-Foundations: Validate all yaml files in puppet.git - https://phabricator.wikimedia.org/T305676 (10fgiunchedi) Thank you for the feedback! To clarify, I think in scope for this task there's validating yaml syntax only as a purely safety measure (as opposed to linting like in... [07:37:23] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for keepit-ssh [puppet] - 10https://gerrit.wikimedia.org/r/784085 (owner: 10Muehlenhoff) [07:38:37] PROBLEM - Host asw1-eqsin is DOWN: PING CRITICAL - Packet loss = 100% [07:38:45] RECOVERY - Host asw1-eqsin is UP: PING OK - Packet loss = 0%, RTA = 314.72 ms [07:39:24] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P25266 and previous config saved to /var/cache/conftool/dbconfig/20220419-073923-ladsgroup.json [07:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:50] (03CR) 10Ayounsi: [C: 03+2] Setup new mr1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/772843 (https://phabricator.wikimedia.org/T294872) (owner: 10Ayounsi) [07:40:55] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004747 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [07:41:27] (03Merged) 10jenkins-bot: Setup new mr1-eqsin [homer/public] - 10https://gerrit.wikimedia.org/r/772843 (https://phabricator.wikimedia.org/T294872) (owner: 10Ayounsi) [07:41:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T306269)', diff saved to https://phabricator.wikimedia.org/P25267 and previous config saved to /var/cache/conftool/dbconfig/20220419-074132-marostegui.json [07:41:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:41:36] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:41:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:38] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [07:41:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1168 (T306269)', diff saved to https://phabricator.wikimedia.org/P25268 and previous config saved to /var/cache/conftool/dbconfig/20220419-074140-marostegui.json [07:41:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306269)', diff saved to https://phabricator.wikimedia.org/P25269 and previous config saved to /var/cache/conftool/dbconfig/20220419-074636-marostegui.json [07:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:46:41] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [07:47:11] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: issue warnings for check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) (owner: 10Filippo Giunchedi) [07:47:16] (03PS2) 10Filippo Giunchedi: profile: issue warnings for check_mw_versions [puppet] - 10https://gerrit.wikimedia.org/r/767729 (https://phabricator.wikimedia.org/T302832) [07:49:12] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on 9 hosts with reason: reboot [07:49:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 9 hosts with reason: reboot [07:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:42] !log restart tilerator on maps1005 (service down, following runbook) [07:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:49] RECOVERY - Check systemd state on maps1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:51:39] RECOVERY - Check systemd state on maps1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:07] RECOVERY - Check systemd state on maps1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:52:11] !log restart tilerator on maps100[678] (service down, following runbook) [07:52:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:53:50] !log restart tilerator on maps1010 (service down, following runbook) [07:53:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:23] RECOVERY - Check systemd state on maps1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:29] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P25270 and previous config saved to /var/cache/conftool/dbconfig/20220419-075428-ladsgroup.json [07:54:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:54:32] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [07:54:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [07:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P25271 and previous config saved to /var/cache/conftool/dbconfig/20220419-075436-ladsgroup.json [07:54:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:54:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:08] (03CR) 10Ayounsi: [C: 03+2] Add script to move devices attributes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [08:00:25] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1162', diff saved to https://phabricator.wikimedia.org/P25272 and previous config saved to /var/cache/conftool/dbconfig/20220419-080024-marostegui.json [08:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:00:49] (03Merged) 10jenkins-bot: Add script to move devices attributes [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/780845 (https://phabricator.wikimedia.org/T259166) (owner: 10Ayounsi) [08:01:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P25273 and previous config saved to /var/cache/conftool/dbconfig/20220419-080141-marostegui.json [08:01:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:41] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:06:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P25275 and previous config saved to /var/cache/conftool/dbconfig/20220419-080620-ladsgroup.json [08:06:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:08:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 1%: After reboot', diff saved to https://phabricator.wikimedia.org/P25276 and previous config saved to /var/cache/conftool/dbconfig/20220419-080802-root.json [08:08:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:58] 10SRE, 10Traffic: Improve handling/logging of HAproxy emergency log messages - https://phabricator.wikimedia.org/T306236 (10Vgutierrez) 05Open→03In progress [08:10:20] 10SRE, 10ops-eqsin, 10DC-Ops, 10Patch-For-Review: Q2(Need By: TBD) rack/setup/install new mr1-eqsin - https://phabricator.wikimedia.org/T294872 (10ayounsi) Device has been swapped, old SRX is now unracked. Netbox has been updated (1st use of the swap attribute script, which worked perfectly) Old: https://... [08:12:26] jouncebot: now [08:12:26] No deployments scheduled for the next 4 hour(s) and 47 minute(s) [08:12:41] PROBLEM - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:14:55] PROBLEM - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:29] !log Restarting CI Jenkins on contint2001 for plugins updates [08:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:45] PROBLEM - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P25277 and previous config saved to /var/cache/conftool/dbconfig/20220419-081646-marostegui.json [08:16:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:19] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) [08:18:42] 10SRE-tools, 10Infrastructure-Foundations, 10netbox, 10Patch-For-Review: Move device attributes - https://phabricator.wikimedia.org/T259166 (10ayounsi) 05Open→03Resolved The script has been used successfully for mr1-eqsin and mr1-ulsfo. [08:19:29] PROBLEM - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:20:10] !log systemctl restart kartotherian on maps1010 [08:20:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P25278 and previous config saved to /var/cache/conftool/dbconfig/20220419-082125-ladsgroup.json [08:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:37] 10SRE, 10ops-ulsfo, 10DC-Ops, 10Infrastructure-Foundations, 10netops: (Need By: TBD) rack/setup/install new mr1-ulsfo - https://phabricator.wikimedia.org/T294314 (10ayounsi) a:05ayounsi→03RobH Netbox has been updated to the best of my knowledge using the new https://netbox.wikimedia.org/extras/script... [08:23:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 5%: After reboot', diff saved to https://phabricator.wikimedia.org/P25279 and previous config saved to /var/cache/conftool/dbconfig/20220419-082306-root.json [08:23:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:55] (03PS3) 10Kormat: mariadb: Stop special-casing db2093 [puppet] - 10https://gerrit.wikimedia.org/r/775852 (https://phabricator.wikimedia.org/T301315) [08:24:14] (03PS3) 10Kormat: mariadb: Use ROW binlog_format for db_inventory. [puppet] - 10https://gerrit.wikimedia.org/r/775330 (https://phabricator.wikimedia.org/T301315) [08:26:33] PROBLEM - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:21] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [08:29:22] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.network.cf (exit_code=0) [08:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:35] !log deploying monitoring change for db2093 T301315 https://gerrit.wikimedia.org/r/c/operations/puppet/+/775852 [08:29:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:39] T301315: Move orchestrator from db2093 to db1115 - https://phabricator.wikimedia.org/T301315 [08:29:47] (03CR) 10Kormat: [C: 03+2] mariadb: Stop special-casing db2093 [puppet] - 10https://gerrit.wikimedia.org/r/775852 (https://phabricator.wikimedia.org/T301315) (owner: 10Kormat) [08:29:55] !log turn CF on for drmrs (test) [08:29:57] !log ayounsi@cumin2002 START - Cookbook sre.network.cf [08:29:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:58] !log ayounsi@cumin2002 END (FAIL) - Cookbook sre.network.cf (exit_code=1) [08:30:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T306269)', diff saved to https://phabricator.wikimedia.org/P25280 and previous config saved to /var/cache/conftool/dbconfig/20220419-083151-marostegui.json [08:31:53] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:31:55] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [08:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:56] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [08:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:00] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25281 and previous config saved to /var/cache/conftool/dbconfig/20220419-083159-marostegui.json [08:32:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:32:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:23] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25282 and previous config saved to /var/cache/conftool/dbconfig/20220419-083623-marostegui.json [08:36:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P25283 and previous config saved to /var/cache/conftool/dbconfig/20220419-083630-ladsgroup.json [08:36:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:38:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 10%: After reboot', diff saved to https://phabricator.wikimedia.org/P25284 and previous config saved to /var/cache/conftool/dbconfig/20220419-083810-root.json [08:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:06] (03CR) 10Elukey: [C: 03+2] "Checked binaries and syntax, look good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/779438 (https://phabricator.wikimedia.org/T301415) (owner: 10Kevin Bazira) [08:49:00] (03CR) 10Elukey: [C: 03+1] sre.kafka.reboot-workers: remove systemctl stop calls (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [08:50:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for drochford (superset access with no server access) - https://phabricator.wikimedia.org/T305634 (10Volans) [08:51:11] (03PS1) 10KartikMistry: Enable SectionTranslation in Test WP for ckb, el, eu, and zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784223 (https://phabricator.wikimedia.org/T304854) [08:51:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P25285 and previous config saved to /var/cache/conftool/dbconfig/20220419-085128-marostegui.json [08:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P25286 and previous config saved to /var/cache/conftool/dbconfig/20220419-085135-ladsgroup.json [08:51:37] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [08:51:40] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [08:51:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:51:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P25287 and previous config saved to /var/cache/conftool/dbconfig/20220419-085148-ladsgroup.json [08:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:53:14] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 25%: After reboot', diff saved to https://phabricator.wikimedia.org/P25288 and previous config saved to /var/cache/conftool/dbconfig/20220419-085313-root.json [08:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [08:56:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team, 10Patch-For-Review: Q2:(Need By: TBD) rack/setup/install ml-serve100[5-8] - https://phabricator.wikimedia.org/T294949 (10cmooney) These hosts hit the ARP issue described in T306421, and have been offline following re-image until this morning: https:... [08:56:33] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10ayounsi) Ping? [08:56:49] (03CR) 10MVernon: [C: 03+2] swift: correct handling of non-ASCII paths in rewrite.py & test suite [puppet] - 10https://gerrit.wikimedia.org/r/779900 (https://phabricator.wikimedia.org/T305942) (owner: 10MVernon) [08:57:00] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[2070-2072].codfw.wmnet with reason: reboot [08:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[2070-2072].codfw.wmnet with reason: reboot [08:57:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:57:14] (03CR) 10Filippo Giunchedi: "LGTM in theory, but not quite familiar with keyholder" [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [08:59:49] (03PS1) 10Kormat: base: Don't set TMOUT as read-only. [puppet] - 10https://gerrit.wikimedia.org/r/784224 [09:01:30] (03CR) 10Filippo Giunchedi: prometheus: enable prometheus web access via proxy with IDP (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [09:01:42] (03PS1) 10Kormat: admin: (kormat) Stop losing state over breaks. [puppet] - 10https://gerrit.wikimedia.org/r/784225 [09:02:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P25290 and previous config saved to /var/cache/conftool/dbconfig/20220419-090256-ladsgroup.json [09:03:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:03:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [09:05:55] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic[1084-1088].eqiad.wmnet with reason: reboot [09:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:03] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic[1084-1088].eqiad.wmnet with reason: reboot [09:06:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316', diff saved to https://phabricator.wikimedia.org/P25291 and previous config saved to /var/cache/conftool/dbconfig/20220419-090633-marostegui.json [09:06:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:08:18] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 50%: After reboot', diff saved to https://phabricator.wikimedia.org/P25292 and previous config saved to /var/cache/conftool/dbconfig/20220419-090817-root.json [09:08:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] base: Don't set TMOUT as read-only. [puppet] - 10https://gerrit.wikimedia.org/r/784224 (owner: 10Kormat) [09:11:52] (03CR) 10Kormat: [C: 03+2] admin: (kormat) Stop losing state over breaks. [puppet] - 10https://gerrit.wikimedia.org/r/784225 (owner: 10Kormat) [09:12:08] (03CR) 10Kormat: [C: 03+2] base: Don't set TMOUT as read-only. [puppet] - 10https://gerrit.wikimedia.org/r/784224 (owner: 10Kormat) [09:15:31] (03CR) 10Muehlenhoff: prometheus: enable prometheus web access via proxy with IDP (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/764895 (https://phabricator.wikimedia.org/T301944) (owner: 10Herron) [09:16:00] RECOVERY - MariaDB Replica Lag: s1 on clouddb1021 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:16:10] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 1236 threshold =0.2 breach: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1347, active_shards: 2886, relocating_shards: 0, initializing_shards: 72, unassigned_shards: 1164, delayed_unassigned_shards: 0, number_of [09:16:10] _tasks: 1628, number_of_in_flight_fetch: 36, task_max_waiting_in_queue_millis: 247056, active_shards_percent_as_number: 70.01455604075691 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:16:36] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:16:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:20] PROBLEM - SSH on aqs1008.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [09:18:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P25293 and previous config saved to /var/cache/conftool/dbconfig/20220419-091802-ladsgroup.json [09:18:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:10] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.eqiad.wmnet is OK: OK - elasticsearch status production-search-eqiad: cluster_name: production-search-eqiad, status: red, timed_out: False, number_of_nodes: 36, number_of_data_nodes: 36, active_primary_shards: 1355, active_shards: 3657, relocating_shards: 0, initializing_shards: 55, unassigned_shards: 410, delayed_unassigned_shards: 0, number_of_pending_tasks: 714, num [09:18:10] n_flight_fetch: 0, task_max_waiting_in_queue_millis: 366354, active_shards_percent_as_number: 88.71906841339155 https://wikitech.wikimedia.org/wiki/Search%23Administration [09:20:13] (03CR) 10Filippo Giunchedi: logstash: populate target index format and add pipeline diagnostics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [09:20:54] (03PS1) 10Alexandros Kosiaris: helmfile.d: Remove all reference to tillerNamespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/784227 [09:21:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25294 and previous config saved to /var/cache/conftool/dbconfig/20220419-092138-marostegui.json [09:21:40] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:21:41] !log kevinbazira@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-goodfaith' for release 'main' . [09:21:42] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [09:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:43] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [09:21:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25295 and previous config saved to /var/cache/conftool/dbconfig/20220419-092146-marostegui.json [09:21:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:22] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 75%: After reboot', diff saved to https://phabricator.wikimedia.org/P25296 and previous config saved to /var/cache/conftool/dbconfig/20220419-092321-root.json [09:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:33] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [09:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:03] (03CR) 10Alexandros Kosiaris: [C: 04-1] "1 minor comment, otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [09:24:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:25:31] (03PS2) 10Filippo Giunchedi: sre: add alerts for exporter-specific unavailability [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) [09:25:49] (03CR) 10Filippo Giunchedi: sre: add alerts for exporter-specific unavailability (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [09:26:23] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:26:27] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes2007.codfw.wmnet, kubernetes2009.codfw.wmnet, kubernetes2010.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2021.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa [09:26:37] PROBLEM - k8s API server requests latencies on ml-serve-ctrl1001 is CRITICAL: instance=10.64.16.202 verb={CREATE,PATCH} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:26:45] PROBLEM - LVS tegola-vector-tiles codfw port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.codfw.wmnet IPv4 on tegola-vector-tiles.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.60 and port 4105: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:26:51] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:26:51] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:26:53] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:27:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25297 and previous config saved to /var/cache/conftool/dbconfig/20220419-092710-marostegui.json [09:27:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:15] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [09:27:45] RECOVERY - k8s API server requests latencies on ml-serve-ctrl1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=27 [09:29:23] (03PS1) 10Muehlenhoff: Extend access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/784228 [09:30:02] ACKNOWLEDGEMENT - Check systemd state on maps1005 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Disabled service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:02] ACKNOWLEDGEMENT - Check systemd state on maps1006 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Disabled service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:02] ACKNOWLEDGEMENT - Check systemd state on maps1007 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Disabled service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:02] ACKNOWLEDGEMENT - Check systemd state on maps1008 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Disabled service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:02] ACKNOWLEDGEMENT - Check systemd state on maps1010 is CRITICAL: CRITICAL - degraded: The following units failed: tilerator.service Hnowlan Disabled service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:30:09] (03PS5) 10Majavah: Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) [09:30:11] (03PS5) 10Majavah: helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) [09:30:57] (03CR) 10Majavah: helmfile.d: add developer-portal (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [09:31:34] (03CR) 10Muehlenhoff: [C: 03+2] Extend access for effeietsanders [puppet] - 10https://gerrit.wikimedia.org/r/784228 (owner: 10Muehlenhoff) [09:33:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P25298 and previous config saved to /var/cache/conftool/dbconfig/20220419-093307-ladsgroup.json [09:33:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:50] (03PS1) 10Muehlenhoff: Fix date [puppet] - 10https://gerrit.wikimedia.org/r/784229 [09:33:53] (03PS1) 10Jgiannelos: Temporarily disable tile pregeneration on eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/784230 (https://phabricator.wikimedia.org/T306424) [09:34:54] (03CR) 10Muehlenhoff: [C: 03+2] Fix date [puppet] - 10https://gerrit.wikimedia.org/r/784229 (owner: 10Muehlenhoff) [09:35:01] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10fgiunchedi) It looks like there's a mismatch in the urls mediawiki is sending out (project `test-commons`) vs the containers for "test commons" that have been created (i.e. `testcommons`, no dash) [09:36:14] (03PS2) 10Jgiannelos: tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784230 (https://phabricator.wikimedia.org/T306424) [09:37:48] (03CR) 10Jgiannelos: [C: 04-1] tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784230 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:38:12] (03CR) 10MSantos: [C: 04-1] "we should disable the cron, not tile caching" [deployment-charts] - 10https://gerrit.wikimedia.org/r/784230 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:38:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1162 (re)pooling @ 100%: After reboot', diff saved to https://phabricator.wikimedia.org/P25299 and previous config saved to /var/cache/conftool/dbconfig/20220419-093825-root.json [09:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:38:45] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:40:16] (03PS1) 10Jgiannelos: tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784231 (https://phabricator.wikimedia.org/T306424) [09:40:37] (03Abandoned) 10Jgiannelos: tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784230 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:40:59] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes2007.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2015.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2021.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2008.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBa [09:41:11] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:41:27] (03CR) 10Muehlenhoff: [C: 03+2] Enable profile::auto_restarts::service for Prometheus [puppet] - 10https://gerrit.wikimedia.org/r/775298 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [09:41:54] (03CR) 10MSantos: [C: 03+1] tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784231 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:42:14] (03PS1) 10MMandere: install_server: Reimage pybal-test2002 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784232 (https://phabricator.wikimedia.org/T297187) [09:42:16] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P25300 and previous config saved to /var/cache/conftool/dbconfig/20220419-094215-marostegui.json [09:42:16] (03PS1) 10MMandere: install_server: Reimage pybal-test2003 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784233 (https://phabricator.wikimedia.org/T297187) [09:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:52] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [09:43:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:32] (03CR) 10Alexandros Kosiaris: [C: 04-1] Add developer-portal chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [09:44:41] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10fgiunchedi) I don't recall running into this before ATM but I'm imagining we'd need to add a rule to normalize container names when the project name contains dashes (or add test-commons to such a list of proj... [09:45:12] (03PS3) 10Alexandros Kosiaris: add developer.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/783849 (https://phabricator.wikimedia.org/T287748) (owner: 10Majavah) [09:45:20] (03CR) 10Jgiannelos: [C: 03+2] tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784231 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:46:29] (03CR) 10Vgutierrez: [C: 03+1] install_server: Reimage pybal-test2002 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784232 (https://phabricator.wikimedia.org/T297187) (owner: 10MMandere) [09:46:35] (03CR) 10Vgutierrez: [C: 03+1] install_server: Reimage pybal-test2003 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784233 (https://phabricator.wikimedia.org/T297187) (owner: 10MMandere) [09:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:48:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P25301 and previous config saved to /var/cache/conftool/dbconfig/20220419-094812-ladsgroup.json [09:48:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:17] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [09:48:17] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:48:19] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [09:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [09:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [09:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:40] (03CR) 10Volans: [C: 03+1] "LGTM, PCC seems happy too https://puppet-compiler.wmflabs.org/pcc-worker1002/34876/" [puppet] - 10https://gerrit.wikimedia.org/r/779897 (owner: 10Ottomata) [09:49:51] (03Merged) 10jenkins-bot: tegola: Temporarily disable tile pregeneration [deployment-charts] - 10https://gerrit.wikimedia.org/r/784231 (https://phabricator.wikimedia.org/T306424) (owner: 10Jgiannelos) [09:50:59] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [09:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:36] (03PS1) 10Muehlenhoff: Remove LDAP access for wikitrent [puppet] - 10https://gerrit.wikimedia.org/r/784234 [09:53:15] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:54:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for wikitrent [puppet] - 10https://gerrit.wikimedia.org/r/784234 (owner: 10Muehlenhoff) [09:55:41] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:55:41] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:57:21] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316', diff saved to https://phabricator.wikimedia.org/P25302 and previous config saved to /var/cache/conftool/dbconfig/20220419-095720-marostegui.json [09:57:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:36] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [09:57:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [09:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:42] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P25303 and previous config saved to /var/cache/conftool/dbconfig/20220419-095742-ladsgroup.json [09:57:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:46] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:04:58] PROBLEM - Outgoing network saturation on labstore1006 is CRITICAL: 1.083e+09 ge 1.062e+09 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [10:05:27] (03CR) 10Alexandros Kosiaris: [C: 03+2] add developer.wikimedia.org alias [dns] - 10https://gerrit.wikimedia.org/r/783849 (https://phabricator.wikimedia.org/T287748) (owner: 10Majavah) [10:11:41] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) [10:12:26] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25304 and previous config saved to /var/cache/conftool/dbconfig/20220419-101225-marostegui.json [10:12:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:12:29] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1131.eqiad.wmnet with reason: Maintenance [10:12:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:32] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [10:12:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1131 (T306269)', diff saved to https://phabricator.wikimedia.org/P25305 and previous config saved to /var/cache/conftool/dbconfig/20220419-101233-marostegui.json [10:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:14:33] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T306269)', diff saved to https://phabricator.wikimedia.org/P25306 and previous config saved to /var/cache/conftool/dbconfig/20220419-101433-marostegui.json [10:14:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:21] RECOVERY - Outgoing network saturation on labstore1006 is OK: (C)1.062e+09 ge (W)9.375e+08 ge 8.66e+08 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005-1006-1007 [10:17:51] !log installing gzip/zgrep security updates [10:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:29] RECOVERY - SSH on aqs1008.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [10:22:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [10:26:45] (03PS1) 10Btullis: Update the datahub image used for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/784235 (https://phabricator.wikimedia.org/T306019) [10:29:38] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P25308 and previous config saved to /var/cache/conftool/dbconfig/20220419-102938-marostegui.json [10:29:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:35:05] (03PS1) 10Jgiannelos: tegola: Temporarily disable swift caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784236 [10:36:36] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10MSantos) A side-note that we should create a follow-up action is: these logs are not registered as errors, b... [10:38:28] (03PS2) 10Jgiannelos: tegola: Temporarily disable swift caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784236 [10:39:14] !log reimage pybal-test2002 as buster - T297187 [10:39:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:19] T297187: Upgrade pybal-test200[23] from Stretch to Buster - https://phabricator.wikimedia.org/T297187 [10:39:23] (03CR) 10Jgiannelos: "This patch disables swift caching on the tegola level. We can try disabling OSM sync and allow live traffic to Postgres." [deployment-charts] - 10https://gerrit.wikimedia.org/r/784236 (owner: 10Jgiannelos) [10:42:41] PROBLEM - Host pybal-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [10:44:43] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131', diff saved to https://phabricator.wikimedia.org/P25309 and previous config saved to /var/cache/conftool/dbconfig/20220419-104443-marostegui.json [10:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:23] (03CR) 10MSantos: [C: 03+1] tegola: Temporarily disable swift caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784236 (owner: 10Jgiannelos) [10:48:47] RECOVERY - Host pybal-test2002 is UP: PING OK - Packet loss = 0%, RTA = 32.06 ms [10:53:45] (03CR) 10MMandere: [C: 03+2] install_server: Reimage pybal-test2002 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784232 (https://phabricator.wikimedia.org/T297187) (owner: 10MMandere) [10:56:35] (03CR) 10Jgiannelos: [C: 03+2] tegola: Temporarily disable swift caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784236 (owner: 10Jgiannelos) [10:57:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P25310 and previous config saved to /var/cache/conftool/dbconfig/20220419-105756-ladsgroup.json [10:58:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:01] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [10:59:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1131 (T306269)', diff saved to https://phabricator.wikimedia.org/P25311 and previous config saved to /var/cache/conftool/dbconfig/20220419-105948-marostegui.json [10:59:50] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:59:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1005.eqiad.wmnet with reason: Maintenance [10:59:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:53] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [10:59:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:44] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) Current/immediate plan of action is: * disable pregen/caching of tiles on swift * temporarily r... [11:00:51] (03Merged) 10jenkins-bot: tegola: Temporarily disable swift caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784236 (owner: 10Jgiannelos) [11:00:56] (03CR) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [11:01:04] (03PS4) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) [11:01:53] (03PS5) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) [11:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:02:14] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:02:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2129.codfw.wmnet with reason: Maintenance [11:02:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:17] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [11:02:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [11:02:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:41] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:02:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:02:52] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:02:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:59] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:04:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:03] PROBLEM - Host pybal-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [11:04:04] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:42] !log installing xz-utils/xzgrep security updates [11:04:43] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:45] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1140.eqiad.wmnet with reason: Maintenance [11:04:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:58] (03CR) 10MarcoAurelio: Enable $wgFixDoubleRedirects on officewiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [11:05:25] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:05:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:30] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:05:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:41] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:05:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:49] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:04] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:07:06] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1098.eqiad.wmnet with reason: Maintenance [11:07:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:11] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1098:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25312 and previous config saved to /var/cache/conftool/dbconfig/20220419-110710-marostegui.json [11:07:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:15] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [11:07:15] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:07:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:19] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:11] nemo-yiannis: how is tegola looking now ? [11:08:35] i am currently facing issues with helm [11:08:43] RECOVERY - Host pybal-test2002 is UP: PING OK - Packet loss = 0%, RTA = 32.15 ms [11:08:46] !log hnowlan@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'similar-users' for release 'main' . [11:08:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:32] ack [11:09:41] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [11:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:03] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.253 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:10:08] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [11:10:09] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.258 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25313 and previous config saved to /var/cache/conftool/dbconfig/20220419-111046-marostegui.json [11:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:51] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.210 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:10:53] RECOVERY - LVS tegola-vector-tiles codfw port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.codfw.wmnet IPv4 on tegola-vector-tiles.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2363 bytes in 1.141 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:10:59] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.207 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:10:59] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.208 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:11:07] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.218 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:11:21] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.134 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:11:21] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.217 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:11:27] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.213 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [11:12:39] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:13:01] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:13:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P25314 and previous config saved to /var/cache/conftool/dbconfig/20220419-111301-ladsgroup.json [11:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:39] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:18:49] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:24] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:21:26] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:21:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:21:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:23:57] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:23:59] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:22] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [11:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:25] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [11:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:25:51] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P25315 and previous config saved to /var/cache/conftool/dbconfig/20220419-112551-marostegui.json [11:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P25316 and previous config saved to /var/cache/conftool/dbconfig/20220419-112806-ladsgroup.json [11:28:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:46] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:28:48] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:28:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:32] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:32:31] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [11:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:09] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 3.345 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:34:45] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:36:17] RECOVERY - PyBal backends health check on lvs1019 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:40:01] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.018 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:40:56] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316', diff saved to https://phabricator.wikimedia.org/P25317 and previous config saved to /var/cache/conftool/dbconfig/20220419-114056-marostegui.json [11:40:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:25] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 6.724 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:43:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P25318 and previous config saved to /var/cache/conftool/dbconfig/20220419-114311-ladsgroup.json [11:43:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:16] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:43:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [11:43:24] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2110.codfw.wmnet with reason: Maintenance [11:43:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 12 hosts with reason: Maintenance [11:43:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 12 hosts with reason: Maintenance [11:43:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [11:47:01] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1136.eqiad.wmnet with OS bullseye [11:47:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:49:23] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [11:49:32] (03PS2) 10MMandere: install_server: Reimage pybal-test2003 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784233 (https://phabricator.wikimedia.org/T297187) [11:50:18] (03CR) 10MMandere: [C: 03+2] install_server: Reimage pybal-test2003 as buster [puppet] - 10https://gerrit.wikimedia.org/r/784233 (https://phabricator.wikimedia.org/T297187) (owner: 10MMandere) [11:52:28] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:52:29] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1121.eqiad.wmnet with reason: Maintenance [11:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:31] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [11:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P25319 and previous config saved to /var/cache/conftool/dbconfig/20220419-115239-ladsgroup.json [11:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:52:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [11:54:22] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes1008.eqiad.wmnet, kubernetes1012.eqiad.wmnet, kubernetes1020.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1022.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [11:56:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1098:3316 (T306269)', diff saved to https://phabricator.wikimedia.org/P25320 and previous config saved to /var/cache/conftool/dbconfig/20220419-115601-marostegui.json [11:56:03] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [11:56:05] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1180.eqiad.wmnet with reason: Maintenance [11:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:06] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [11:56:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:10] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depooling db1180 (T306269)', diff saved to https://phabricator.wikimedia.org/P25321 and previous config saved to /var/cache/conftool/dbconfig/20220419-115609-marostegui.json [11:56:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:56:27] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [11:56:54] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [11:56:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:45] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1136.eqiad.wmnet with reason: host reimage [11:57:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:16] PROBLEM - Host pybal-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:01:08] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1136.eqiad.wmnet with reason: host reimage [12:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:01:54] (03CR) 10Btullis: [C: 03+2] Update the datahub image used for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/784235 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis) [12:02:02] !log create tegola-swift-fallback container in account tegola [12:02:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:21] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes1022.eqiad.wmnet, kubernetes1008.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1007.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1005.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:03:23] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:03:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P25322 and previous config saved to /var/cache/conftool/dbconfig/20220419-120327-ladsgroup.json [12:03:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:03:33] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:03:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:04:49] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:05:23] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:05:56] (03Merged) 10jenkins-bot: Update the datahub image used for deployment [deployment-charts] - 10https://gerrit.wikimedia.org/r/784235 (https://phabricator.wikimedia.org/T306019) (owner: 10Btullis) [12:06:01] RECOVERY - Host pybal-test2002 is UP: PING OK - Packet loss = 0%, RTA = 32.09 ms [12:06:33] PROBLEM - SSH on pybal-test2002 is CRITICAL: connect to address 10.192.16.140 and port 22: Connection refused https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:06:40] (03PS1) 10Jgiannelos: tegola: Use fallback swift container as interim cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 [12:06:57] (03PS1) 10Hnowlan: tegola: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/784246 (https://phabricator.wikimedia.org/T306424) [12:07:45] RECOVERY - SSH on pybal-test2002 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:08:04] (03CR) 10Jgiannelos: "We created a new interim swift container just to add some caching in the current state." [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:08:49] (03CR) 10MSantos: [C: 03+1] tegola: Use fallback swift container as interim cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:09:09] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes1008.eqiad.wmnet, kubernetes1022.eqiad.wmnet, kubernetes1019.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1014.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1010.eqiad.wmnet, kubernetes1016.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1017.eqiad.wmnet, kubernetes1006.eqiad.wmnet, kubernetes [12:09:09] ad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:09:43] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:09:53] (03CR) 10Jgiannelos: "From the CLI it looks like the container works for our credentials" [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:09:55] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:11:01] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 3.007 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:12:03] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:12:38] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main [12:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:29] (03PS2) 10Jgiannelos: tegola: Use fallback swift container as interim cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 [12:14:03] PROBLEM - Host pybal-test2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:14:27] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main [12:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:15:08] (03CR) 10Jgiannelos: "I added a shared base path so we use the same resources between eqiad/codfw" [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:15:47] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.60 and port 4105: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:16:33] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=1) for host db1136.eqiad.wmnet with OS bullseye [12:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:45] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:16:57] RECOVERY - Host pybal-test2002 is UP: PING OK - Packet loss = 0%, RTA = 31.80 ms [12:17:04] (03CR) 10MSantos: [C: 03+1] tegola: Use fallback swift container as interim cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:17:19] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:17:41] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 9.547 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:18:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P25323 and previous config saved to /var/cache/conftool/dbconfig/20220419-121832-ladsgroup.json [12:18:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:50] (03CR) 10Jgiannelos: [C: 03+2] tegola: Use fallback swift container as interim cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:18:57] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:20:00] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/datahub: apply on main [12:20:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:46] (03PS6) 10Majavah: Add developer-portal chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) [12:20:48] (03PS6) 10Majavah: helmfile.d: add developer-portal [deployment-charts] - 10https://gerrit.wikimedia.org/r/773995 (https://phabricator.wikimedia.org/T297140) [12:21:30] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/datahub: sync on main [12:21:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:18] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/datahub: apply on main [12:22:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:22:36] (03Merged) 10jenkins-bot: tegola: Use fallback swift container as interim cache [deployment-charts] - 10https://gerrit.wikimedia.org/r/784245 (owner: 10Jgiannelos) [12:23:10] (03CR) 10Majavah: Add developer-portal chart (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/773994 (https://phabricator.wikimedia.org/T297140) (owner: 10Majavah) [12:23:19] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:23:19] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/datahub: sync on main [12:23:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:37] PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes1009.eqiad.wmnet, kubernetes1018.eqiad.wmnet, kubernetes1005.eqiad.wmnet, kubernetes1013.eqiad.wmnet, kubernetes1021.eqiad.wmnet, kubernetes1015.eqiad.wmnet, kubernetes1016.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:25:07] (03PS1) 10Jgiannelos: tegola: Enable interim caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784249 [12:25:14] (03PS2) 10Jgiannelos: tegola: Enable interim caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784249 [12:25:29] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 7.438 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:25:50] (03CR) 10Jgiannelos: "Forgot to flip the switch to enable caching in the previous patch" [deployment-charts] - 10https://gerrit.wikimedia.org/r/784249 (owner: 10Jgiannelos) [12:26:39] (03CR) 10MSantos: [C: 03+1] tegola: Enable interim caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784249 (owner: 10Jgiannelos) [12:26:55] RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:28:55] (LogstashIngestSpike) firing: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:29:15] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:30:35] (03PS1) 10Filippo Giunchedi: swift: add disable_fallocate config option [puppet] - 10https://gerrit.wikimedia.org/r/784250 (https://phabricator.wikimedia.org/T306424) [12:31:18] !log mmandere@cumin1001 START - Cookbook sre.puppet.renew-cert for pybal-test2002.codfw.wmnet: Renew puppet certificate - mmandere@cumin1001 [12:31:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:21] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - tegola-vector-tiles_4105: Servers kubernetes2007.codfw.wmnet, kubernetes2018.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2017.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2022.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [12:31:21] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for pybal-test2002.codfw.wmnet: Renew puppet certificate - mmandere@cumin1001 [12:31:23] (03CR) 10Jgiannelos: [C: 03+2] tegola: Enable interim caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784249 (owner: 10Jgiannelos) [12:31:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:27] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:31:46] (03PS2) 10Filippo Giunchedi: swift: add disable_fallocate config option [puppet] - 10https://gerrit.wikimedia.org/r/784250 (https://phabricator.wikimedia.org/T306424) [12:32:51] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [12:33:14] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/784250 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [12:33:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121', diff saved to https://phabricator.wikimedia.org/P25324 and previous config saved to /var/cache/conftool/dbconfig/20220419-123337-ladsgroup.json [12:33:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:55] (LogstashIngestSpike) resolved: (2) Logstash rate of ingestion percent change compared to yesterday - https://phabricator.wikimedia.org/T202307 - https://grafana.wikimedia.org/d/000000561/logstash?orgId=1&panelId=2&fullscreen - https://alerts.wikimedia.org/?q=alertname%3DLogstashIngestSpike [12:35:20] (03PS1) 10Hnowlan: tegola: increase memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/784254 (https://phabricator.wikimedia.org/T306424) [12:36:17] (03Merged) 10jenkins-bot: tegola: Enable interim caching [deployment-charts] - 10https://gerrit.wikimedia.org/r/784249 (owner: 10Jgiannelos) [12:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [12:37:39] (03CR) 10Jgiannelos: [C: 03+1] tegola: increase memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/784254 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [12:38:37] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [12:38:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:40:48] (03CR) 10Hnowlan: [C: 03+2] tegola: increase memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/784254 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [12:40:57] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [12:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:07] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [12:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:28] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [12:41:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:37] (03CR) 10Phedenskog: "I've updated the code to use the new proxy, so hopefully that will work." [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [12:43:49] PROBLEM - LVS tegola-vector-tiles codfw port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.codfw.wmnet IPv4 on tegola-vector-tiles.svc.codfw.wmnet is CRITICAL: connect to address 10.2.1.60 and port 4105: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:44:46] (03Merged) 10jenkins-bot: tegola: increase memory usage [deployment-charts] - 10https://gerrit.wikimedia.org/r/784254 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [12:45:40] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [12:45:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:09] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: connect to address 10.2.2.60 and port 4105: Connection refused https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:46:10] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [12:46:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:43] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [12:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:55] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [12:46:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:16] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [12:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:25] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.550 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:47:26] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 1.902 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:47:33] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:47:53] RECOVERY - LVS tegola-vector-tiles codfw port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.codfw.wmnet IPv4 on tegola-vector-tiles.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2364 bytes in 1.157 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:48:07] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.164 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:48:17] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 1.142 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:48:17] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 2.271 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:48:43] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.200 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:48:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1121 (T298565)', diff saved to https://phabricator.wikimedia.org/P25325 and previous config saved to /var/cache/conftool/dbconfig/20220419-124843-ladsgroup.json [12:48:45] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [12:48:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1141.eqiad.wmnet with reason: Maintenance [12:48:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:48] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [12:48:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P25326 and previous config saved to /var/cache/conftool/dbconfig/20220419-124851-ladsgroup.json [12:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:53] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 2.288 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:48:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:07] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.086 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:49:17] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.606 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:52:01] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 1.284 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:52:53] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.099 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:53:07] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 8.920 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:54:03] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 8.700 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:55:45] (03PS1) 104nn1l2: mrwikisource: Add template editor and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784255 (https://phabricator.wikimedia.org/T269067) [12:56:21] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:56:24] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T306269)', diff saved to https://phabricator.wikimedia.org/P25327 and previous config saved to /var/cache/conftool/dbconfig/20220419-125623-marostegui.json [12:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:56:31] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [12:57:09] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.059 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:58:33] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 7.155 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [12:58:37] (03PS1) 10Vgutierrez: cache::haproxy: Log emergency messages to disk [puppet] - 10https://gerrit.wikimedia.org/r/784256 (https://phabricator.wikimedia.org/T306236) [12:59:40] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P25328 and previous config saved to /var/cache/conftool/dbconfig/20220419-125939-ladsgroup.json [12:59:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:44] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:00:05] RoanKattouw, Lucas_WMDE, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T1300). [13:00:05] nn1l2: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:17] PROBLEM - SSH on wtp1036.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:00:28] I’m in a meeting and probably can’t deploy, sorry [13:00:34] hi [13:00:41] Lucas_WMDE: add it to the meeting agenda [13:03:01] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 2.228 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:03:30] !log volans@cumin1001 START - Cookbook sre.network.cf [13:03:30] !log volans@cumin1001 END (PASS) - Cookbook sre.network.cf (exit_code=0) [13:03:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:01] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:04:02] (03PS1) 10Hnowlan: tegola: increase memory limit further [deployment-charts] - 10https://gerrit.wikimedia.org/r/784257 (https://phabricator.wikimedia.org/T306424) [13:04:10] (03CR) 10jerkins-bot: [V: 04-1] tegola: increase memory limit further [deployment-charts] - 10https://gerrit.wikimedia.org/r/784257 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [13:04:16] Urbanecm: hi, can you deploy in this window? [13:05:17] (03PS2) 10Hnowlan: tegola: increase memory limit further [deployment-charts] - 10https://gerrit.wikimedia.org/r/784257 (https://phabricator.wikimedia.org/T306424) [13:05:29] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 4.435 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:06:41] RoanKattouw: Hi, can you deploy in this window? [13:07:04] (03CR) 10MVernon: [C: 03+1] "LGTM, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/784250 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [13:07:34] (03CR) 10Filippo Giunchedi: [C: 03+2] swift: add disable_fallocate config option [puppet] - 10https://gerrit.wikimedia.org/r/784250 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [13:07:47] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:09:19] I'll be around in this window. Anybody who can deploy, please ping me. Thanks [13:09:30] (03PS4) 10Elukey: Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) [13:09:49] (03CR) 10Jgiannelos: [C: 03+1] tegola: increase memory limit further [deployment-charts] - 10https://gerrit.wikimedia.org/r/784257 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [13:09:53] (03PS1) 10Filippo Giunchedi: hieradata: temp disable fallocate for thanos-swift [puppet] - 10https://gerrit.wikimedia.org/r/784258 (https://phabricator.wikimedia.org/T306424) [13:09:56] (03PS5) 10Elukey: role::ml_k8s::master: change the codfw svc/pod IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) [13:10:09] (03PS2) 10Elukey: Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) [13:10:29] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 251 bytes in 3.227 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:11:29] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P25329 and previous config saved to /var/cache/conftool/dbconfig/20220419-131128-marostegui.json [13:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:33] PROBLEM - SSH on wtp1035.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:11:53] (03PS13) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [13:12:19] (03PS1) 10Muehlenhoff: Don't prompt for loading additional firmware in d-i [puppet] - 10https://gerrit.wikimedia.org/r/784259 (https://phabricator.wikimedia.org/T306148) [13:12:28] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [13:13:45] (JobUnavailable) firing: (2) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:14:27] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/784258 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [13:14:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P25330 and previous config saved to /var/cache/conftool/dbconfig/20220419-131444-ladsgroup.json [13:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:09] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - k [13:15:09] s-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kube [13:15:09] l-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:15:39] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Volans) [13:15:51] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1104.eqiad.wmnet with reason: Rebooting for T303174 [13:15:52] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1104.eqiad.wmnet with reason: Rebooting for T303174 [13:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1104 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25331 and previous config saved to /var/cache/conftool/dbconfig/20220419-131557-kormat.json [13:15:59] the above BGP alarms are due to maintenance to the ml codfw cluster [13:16:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:12] (03CR) 10Hnowlan: [C: 03+2] tegola: increase memory limit further [deployment-charts] - 10https://gerrit.wikimedia.org/r/784257 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [13:16:33] PROBLEM - PyBal backends health check on lvs2010 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve2007.codfw.wmnet, ml-serve2005.codfw.wmnet, ml-serve2008.codfw.wmnet, ml-serve2006.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:16:53] same for --^ [13:17:00] all expected, I stopped all nodes [13:17:57] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:18:04] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/784259 (https://phabricator.wikimedia.org/T306148) (owner: 10Muehlenhoff) [13:18:45] (JobUnavailable) firing: (5) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:19:36] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/pcc-worker1003/1292/" [puppet] - 10https://gerrit.wikimedia.org/r/784258 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [13:20:01] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: / (spec from root) is CRITICAL: Test spec from root returned the unexpected status 503 (expecting: 200): /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [13:20:29] nn1l2: hello! looking at your patches now if you're still around [13:20:38] yes, thanks [13:21:07] (03Merged) 10jenkins-bot: tegola: increase memory limit further [deployment-charts] - 10https://gerrit.wikimedia.org/r/784257 (https://phabricator.wikimedia.org/T306424) (owner: 10Hnowlan) [13:21:51] !log kormat@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25332 and previous config saved to /var/cache/conftool/dbconfig/20220419-132151-kormat.json [13:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:03] (03CR) 10Majavah: [C: 03+2] mrwikisource: Add template editor and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784255 (https://phabricator.wikimedia.org/T269067) (owner: 104nn1l2) [13:22:13] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:22:23] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv4: Active - k [13:22:23] s-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv4: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernet [13:22:23] dfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:22:46] (03Merged) 10jenkins-bot: mrwikisource: Add template editor and patroller user groups [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784255 (https://phabricator.wikimedia.org/T269067) (owner: 104nn1l2) [13:23:40] nn1l2: can you test on mwdebug1001 please? [13:23:46] ok [13:24:59] PROBLEM - PyBal backends health check on lvs2009 is CRITICAL: PYBAL CRITICAL - CRITICAL - inference_30443: Servers ml-serve2007.codfw.wmnet, ml-serve2005.codfw.wmnet, ml-serve2008.codfw.wmnet, ml-serve2006.codfw.wmnet are marked down but pooled: ml-ctrl_6443: Servers ml-serve-ctrl2001.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal [13:25:25] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [13:25:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:28] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [13:25:30] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [13:25:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:33] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [13:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:40] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: sync [13:25:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:05] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: sync [13:26:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:15] LGTM [13:26:20] syncing [13:26:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P25333 and previous config saved to /var/cache/conftool/dbconfig/20220419-132634-marostegui.json [13:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:47] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [13:26:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:57] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::master: change the codfw svc/pod IP ranges [puppet] - 10https://gerrit.wikimedia.org/r/776880 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [13:27:10] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [13:27:11] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.840 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:27:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:15] !log taavi@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:784255|mrwikisource: Add template editor and patroller user groups (T269067)]] (duration: 00m 50s) [13:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:19] T269067: Add User Groups to mrwikisource, enable admins to add/remove these groups. - https://phabricator.wikimedia.org/T269067 [13:27:38] (03CR) 10Elukey: [C: 03+2] Change the Calico's pod IP subnet for ml-serve-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/776877 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [13:28:05] (03CR) 10Elukey: [C: 03+2] Change POD IPv4 subnet for ml-serve-codfw [homer/public] - 10https://gerrit.wikimedia.org/r/778208 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [13:28:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:28:58] anyone have anything else to deploy? [13:29:14] Thanks! [13:29:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141', diff saved to https://phabricator.wikimedia.org/P25334 and previous config saved to /var/cache/conftool/dbconfig/20220419-132949-ladsgroup.json [13:29:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:34] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db2080.codfw.wmnet with reason: Rebooting for T303174 [13:30:36] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db2080.codfw.wmnet with reason: Rebooting for T303174 [13:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:31:09] (03CR) 10MVernon: [C: 03+1] "LGTM; I think this will require a service restart to take effect?" [puppet] - 10https://gerrit.wikimedia.org/r/784258 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [13:31:41] (03PS14) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [13:32:14] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [13:33:17] (03PS15) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [13:33:52] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [13:34:15] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:35:01] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:35:05] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:36:36] (03CR) 10Filippo Giunchedi: [C: 03+2] "Thanks! Correct, I'll do a roll-restart" [puppet] - 10https://gerrit.wikimedia.org/r/784258 (https://phabricator.wikimedia.org/T306424) (owner: 10Filippo Giunchedi) [13:36:55] !log kormat@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25335 and previous config saved to /var/cache/conftool/dbconfig/20220419-133655-kormat.json [13:36:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Volans) Adding @BGerdemann for approval (contract side), please also provide a contract end date. Adding @odimitrijevic for approval (analytics side). Adding @KFrancis f... [13:41:39] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T306269)', diff saved to https://phabricator.wikimedia.org/P25336 and previous config saved to /var/cache/conftool/dbconfig/20220419-134139-marostegui.json [13:41:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:41:45] T306269: Make primary key ipblocks.ipb_id unsigned on wmf wikis - https://phabricator.wikimedia.org/T306269 [13:44:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1141 (T298565)', diff saved to https://phabricator.wikimedia.org/P25337 and previous config saved to /var/cache/conftool/dbconfig/20220419-134455-ladsgroup.json [13:44:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:44:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1142.eqiad.wmnet with reason: Maintenance [13:44:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P25338 and previous config saved to /var/cache/conftool/dbconfig/20220419-134503-ladsgroup.json [13:45:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:46:17] !log hnowlan@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=eqiad [13:46:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:48:16] (03Abandoned) 10Bking: Revert "elasticsearch: upgrade eqiad to elasticsearch 6.8" [puppet] - 10https://gerrit.wikimedia.org/r/780640 (owner: 10Bking) [13:48:28] (03PS2) 10KartikMistry: Enable SectionTranslation in Test WP for ckb, el, eu, and zh-yue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784223 (https://phabricator.wikimedia.org/T304854) [13:49:59] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 4.052 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:50:01] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1169.eqiad.wmnet with reason: Rebooting for T303174 [13:50:02] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1169.eqiad.wmnet with reason: Rebooting for T303174 [13:50:03] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.759 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:50:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:08] !log kormat@cumin1001 dbctl commit (dc=all): 'db1169 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25339 and previous config saved to /var/cache/conftool/dbconfig/20220419-135007-kormat.json [13:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:19] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 1.000 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:50:33] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.045 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:50:49] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.104 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:51:33] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1129.eqiad.wmnet with reason: Rebooting for T303174 [13:51:35] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1129.eqiad.wmnet with reason: Rebooting for T303174 [13:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:40] !log kormat@cumin1001 dbctl commit (dc=all): 'db1129 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25340 and previous config saved to /var/cache/conftool/dbconfig/20220419-135140-kormat.json [13:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:45] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:51:59] !log kormat@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25341 and previous config saved to /var/cache/conftool/dbconfig/20220419-135159-kormat.json [13:52:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:14] 10SRE, 10Traffic: Clean up Traffic Grafana dashboards to reflect HA-Proxy metrics - https://phabricator.wikimedia.org/T304153 (10MMandere) 05Open→03In progress [13:52:15] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:52:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1110.eqiad.wmnet with reason: Rebooting for T303174 [13:52:20] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1110.eqiad.wmnet with reason: Rebooting for T303174 [13:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:21] 10SRE, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Test haproxy as a WMF's CDN TLS terminator with real traffic - https://phabricator.wikimedia.org/T290005 (10MMandere) [13:52:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:25] !log kormat@cumin1001 dbctl commit (dc=all): 'db1110 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25342 and previous config saved to /var/cache/conftool/dbconfig/20220419-135225-kormat.json [13:52:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:29] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:53:35] PROBLEM - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:53:51] PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:54:01] PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:54:05] PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:54:31] PROBLEM - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:54:33] PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:54:50] !log kormat@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25343 and previous config saved to /var/cache/conftool/dbconfig/20220419-135450-kormat.json [13:54:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:55] PROBLEM - LVS tegola-vector-tiles codfw port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.codfw.wmnet IPv4 on tegola-vector-tiles.svc.codfw.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 5.580 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [13:55:02] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10mepps) I approve this request. [13:55:05] PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:55:33] PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:55:40] !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=eqiad [13:55:41] PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [13:55:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P25344 and previous config saved to /var/cache/conftool/dbconfig/20220419-135542-ladsgroup.json [13:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:55:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [13:56:33] !log kormat@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25345 and previous config saved to /var/cache/conftool/dbconfig/20220419-135632-kormat.json [13:56:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:59:59] RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 2.363 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:00:07] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:00:11] RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 5.436 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:00:17] RECOVERY - LVS kartotherian-ssl codfw port 443/tcp - Kartotherian- kartotherian.svc.codfw.wmnet - HTTPS IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.181 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:00:45] RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.541 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:00:49] RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.178 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:01:12] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1129.eqiad.wmnet with reason: Rebooting for T303174 [14:01:13] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1129.eqiad.wmnet with reason: Rebooting for T303174 [14:01:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:15] RECOVERY - LVS kartotherian codfw port 6533/tcp - Kartotherian- kartotherian.svc.codfw.wmnet IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:01:17] RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:29] RECOVERY - SSH on wtp1036.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:01:47] RECOVERY - LVS tegola-vector-tiles codfw port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.codfw.wmnet IPv4 on tegola-vector-tiles.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2365 bytes in 1.318 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:01:51] RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.187 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:02:53] RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:03:30] (03PS1) 10Cathal Mooney: Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 [14:03:51] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:03:51] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:04:05] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:04:11] (03CR) 10jerkins-bot: [V: 04-1] Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 (owner: 10Cathal Mooney) [14:04:27] PROBLEM - Maps HTTPS on maps1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:04:30] !log kormat@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25346 and previous config saved to /var/cache/conftool/dbconfig/20220419-140430-kormat.json [14:04:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:45] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:05:32] (03PS2) 10Cathal Mooney: Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 [14:05:41] PROBLEM - Maps HTTPS on maps1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:05:46] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2001.codfw.wmnet with OS bullseye [14:05:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:13] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 1.332 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:06:15] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:06:41] !log start deleting tegola-cache/osm prefix from tegola-swift-container - T306424 [14:06:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:45] (03CR) 10jerkins-bot: [V: 04-1] Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 (owner: 10Cathal Mooney) [14:06:45] T306424: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 [14:07:03] !log kormat@cumin1001 dbctl commit (dc=all): 'db1104 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25347 and previous config saved to /var/cache/conftool/dbconfig/20220419-140703-kormat.json [14:07:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:09] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:07:45] (03PS3) 10Cathal Mooney: Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 [14:07:50] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1111.eqiad.wmnet with reason: Rebooting for T303174 [14:07:51] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1111.eqiad.wmnet with reason: Rebooting for T303174 [14:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:57] !log kormat@cumin1001 dbctl commit (dc=all): 'db1111 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25348 and previous config saved to /var/cache/conftool/dbconfig/20220419-140756-kormat.json [14:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:17] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 1.726 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:09:41] 10SRE, 10SRE-Access-Requests, 10Data-Engineering, 10Generated Data Platform: Request to grant cparle and mfossati login to an-airflow1003.eqiad.wmne - https://phabricator.wikimedia.org/T306057 (10Ottomata) > I could just add them directly to the analytics-platform-eng-admins I'm going to choose this optio... [14:09:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25349 and previous config saved to /var/cache/conftool/dbconfig/20220419-140954-kormat.json [14:09:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] 10SRE, 10ops-codfw: codfw: Dedicate Rack B1 for cloudX-dev servers - https://phabricator.wikimedia.org/T305469 (10Papaul) [14:10:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P25350 and previous config saved to /var/cache/conftool/dbconfig/20220419-141047-ladsgroup.json [14:10:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:37] !log kormat@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25351 and previous config saved to /var/cache/conftool/dbconfig/20220419-141136-kormat.json [14:11:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] !log kormat@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25352 and previous config saved to /var/cache/conftool/dbconfig/20220419-141146-kormat.json [14:11:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:45] RECOVERY - SSH on wtp1035.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [14:12:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:12:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1110.eqiad.wmnet with reason: Maintenance [14:12:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25353 and previous config saved to /var/cache/conftool/dbconfig/20220419-141303-ladsgroup.json [14:13:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:13:17] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:13:53] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.159 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:15:21] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:15:50] !log edited directly phab database to fix corrupt entry T305919 [14:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:54] T305919: De-link my aodit@wikimedia.org staff email from personal volunteer profile - https://phabricator.wikimedia.org/T305919 [14:16:19] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:24] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2002.codfw.wmnet with OS bullseye [14:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:23] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2003.codfw.wmnet with OS bullseye [14:17:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:34] !log kormat@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25354 and previous config saved to /var/cache/conftool/dbconfig/20220419-141933-kormat.json [14:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25355 and previous config saved to /var/cache/conftool/dbconfig/20220419-141937-ladsgroup.json [14:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:45] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:19:53] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 8.366 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [14:20:47] PROBLEM - Maps HTTPS on maps1007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:22:23] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [14:22:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:16] (03CR) 10Volans: [C: 03+1] "I didn't test it but seems reasonable." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 (owner: 10Cathal Mooney) [14:23:33] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10AAssaf-WMF) [14:23:50] (03PS16) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:24:26] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:24:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25356 and previous config saved to /var/cache/conftool/dbconfig/20220419-142457-kormat.json [14:25:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:19] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2004.codfw.wmnet with OS bullseye [14:25:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:47] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2001.codfw.wmnet with reason: host reimage [14:25:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142', diff saved to https://phabricator.wikimedia.org/P25357 and previous config saved to /var/cache/conftool/dbconfig/20220419-142552-ladsgroup.json [14:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:41] !log kormat@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25358 and previous config saved to /var/cache/conftool/dbconfig/20220419-142640-kormat.json [14:26:41] (03PS17) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:26:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:26:51] !log kormat@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25359 and previous config saved to /var/cache/conftool/dbconfig/20220419-142650-kormat.json [14:26:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:11] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) @mepps could you provide confirmation that we should use Essex Wikimedia email account, and not gmail? We didn't get any response from him about this. [14:27:15] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:28:43] (03PS18) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:29:41] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:31:28] (03CR) 10Cathal Mooney: [C: 03+2] Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 (owner: 10Cathal Mooney) [14:31:56] (03CR) 10Ottomata: "Ditto what Luca said. We can and should be able to run MirrorMaker anywhere. The fact that it is colocated with Kafka brokers right now " [puppet] - 10https://gerrit.wikimedia.org/r/779086 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [14:32:05] (03Merged) 10jenkins-bot: Update Netbox Move Server Script to Copy original Tagged Vlans [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/784263 (owner: 10Cathal Mooney) [14:32:49] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:32:56] (03CR) 10Ottomata: [C: 03+2] analytics: migrate clean_jupyter_user_local_trash cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/782339 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:33:07] (03PS19) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:33:32] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [14:33:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:49] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:33:49] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [14:33:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:30] (03CR) 10Ottomata: sre.kafka.reboot-workers: remove systemctl stop calls (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [14:34:38] !log kormat@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25360 and previous config saved to /var/cache/conftool/dbconfig/20220419-143437-kormat.json [14:34:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:42] (03CR) 10Ottomata: [C: 03+1] sre.kafka.reboot-workers: remove systemctl stop calls [cookbooks] - 10https://gerrit.wikimedia.org/r/778517 (https://phabricator.wikimedia.org/T305652) (owner: 10Herron) [14:34:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25361 and previous config saved to /var/cache/conftool/dbconfig/20220419-143444-ladsgroup.json [14:34:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:47] (03PS2) 10Ottomata: analytics: remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/782340 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:36:04] (03PS20) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:36:27] 10SRE, 10LDAP-Access-Requests: Grant Access to nda for jmads - https://phabricator.wikimedia.org/T306117 (10Volans) Pending the related T249873 at this point, to do all together. [14:36:46] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:36:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2002.codfw.wmnet with reason: host reimage [14:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:10] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2001.codfw.wmnet with OS bullseye [14:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:50] (03PS1) 10Ssingh: dnsdist: add support for retaining capabilites after startup [puppet] - 10https://gerrit.wikimedia.org/r/784270 [14:39:14] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2005.codfw.wmnet with OS bullseye [14:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:35] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34887/console" [puppet] - 10https://gerrit.wikimedia.org/r/784270 (owner: 10Ssingh) [14:39:51] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2003.codfw.wmnet with reason: host reimage [14:39:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:02] !log kormat@cumin1001 dbctl commit (dc=all): 'db1169 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25362 and previous config saved to /var/cache/conftool/dbconfig/20220419-144001-kormat.json [14:40:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:54] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation=create https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1142 (T298565)', diff saved to https://phabricator.wikimedia.org/P25363 and previous config saved to /var/cache/conftool/dbconfig/20220419-144057-ladsgroup.json [14:41:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [14:41:01] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1143.eqiad.wmnet with reason: Maintenance [14:41:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:41:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P25364 and previous config saved to /var/cache/conftool/dbconfig/20220419-144105-ladsgroup.json [14:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:24] (03CR) 10BPirkle: [C: 03+1] "Approved for self-merge and deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/767080 (https://phabricator.wikimedia.org/T300914) (owner: 10Hnowlan) [14:41:45] !log kormat@cumin1001 dbctl commit (dc=all): 'db1110 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25365 and previous config saved to /var/cache/conftool/dbconfig/20220419-144144-kormat.json [14:41:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:54] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/d/000000435/kubernetes-api?orgId=1&viewPanel=28 [14:41:54] !log kormat@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25366 and previous config saved to /var/cache/conftool/dbconfig/20220419-144154-kormat.json [14:41:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:06] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [14:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:31] (03PS1) 10Ssingh: dnsdist: add CAP_BPF to systemd override for eBPF support [puppet] - 10https://gerrit.wikimedia.org/r/784272 [14:43:22] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34888/console" [puppet] - 10https://gerrit.wikimedia.org/r/784272 (owner: 10Ssingh) [14:45:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10eigyan) [14:45:41] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2004.codfw.wmnet with reason: host reimage [14:45:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10eigyan) Greetings @jcrespo sorry about the mixup, please use my eigyan@wikimedia.org account. I have updated the ticket with the same. [14:46:44] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) 05Stalled→03Open [14:46:55] 10SRE, 10Traffic, 10Patch-For-Review: Improve handling/logging of HAproxy emergency log messages - https://phabricator.wikimedia.org/T306236 (10Vgutierrez) So I was considering a third approach, parsing the termination_state field from HAProxy request log, but it won't give the exact issue (PC and RC seems t... [14:47:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10eigyan) a:05mepps→03Dzahn [14:48:14] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 3.236 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:48:16] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2002.codfw.wmnet with OS bullseye [14:48:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:30] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:48:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1157.eqiad.wmnet with reason: Maintenance [14:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25367 and previous config saved to /var/cache/conftool/dbconfig/20220419-144836-ladsgroup.json [14:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [14:49:42] !log kormat@cumin1001 dbctl commit (dc=all): 'db1129 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25368 and previous config saved to /var/cache/conftool/dbconfig/20220419-144941-kormat.json [14:49:42] (03CR) 10Ottomata: [C: 03+2] analytics: remove absented clean_jupyter_user_local_trash cron [puppet] - 10https://gerrit.wikimedia.org/r/782340 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:49:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:45] (03PS21) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [14:49:46] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) [14:49:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110', diff saved to https://phabricator.wikimedia.org/P25369 and previous config saved to /var/cache/conftool/dbconfig/20220419-144949-ladsgroup.json [14:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:53] (03CR) 10Ottomata: "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/782340 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [14:50:19] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [14:51:13] (03PS2) 10Ssingh: dnsdist: add support for retaining capabilites after startup [puppet] - 10https://gerrit.wikimedia.org/r/784270 [14:51:29] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) Waiting now for @Ottomata approval (Data Engineering). [14:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P25370 and previous config saved to /var/cache/conftool/dbconfig/20220419-145143-ladsgroup.json [14:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:01] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/34890/console" [puppet] - 10https://gerrit.wikimedia.org/r/784270 (owner: 10Ssingh) [14:52:12] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10Dzahn) a:05Dzahn→03None Thanks all! Giving this ticket back to the pool but don't worry, it will be handled soon. [14:52:26] PROBLEM - Maps HTTPS on maps1010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [14:52:29] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2003.codfw.wmnet with OS bullseye [14:52:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:14] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2006.codfw.wmnet with OS bullseye [14:54:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:32] (03CR) 10Ssingh: [V: 03+1] "We won't be merging this till the Wikidough hosts are reimaged to bullseye but this is ready for review nevertheless." [puppet] - 10https://gerrit.wikimedia.org/r/784272 (owner: 10Ssingh) [14:54:33] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2007.codfw.wmnet with OS bullseye [14:54:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:53] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2005.codfw.wmnet with reason: host reimage [14:54:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:33] (03CR) 10Ssingh: [V: 03+1] "The current capability set from dnsdist.service on doh1001:" [puppet] - 10https://gerrit.wikimedia.org/r/784272 (owner: 10Ssingh) [14:56:51] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) I have temporarily disabled `fallocate` in thanos-swift with https://gerri... [14:56:58] !log kormat@cumin1001 dbctl commit (dc=all): 'db1111 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25371 and previous config saved to /var/cache/conftool/dbconfig/20220419-145658-kormat.json [14:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:29] 10ops-codfw: mc2031.mgmt looks down from icinga's perspective - https://phabricator.wikimedia.org/T306438 (10elukey) [14:58:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2005.codfw.wmnet with reason: host reimage [14:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:45] (JobUnavailable) firing: (6) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:02:01] (03PS22) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:02:35] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [15:03:03] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2004.codfw.wmnet with OS bullseye [15:03:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2009.codfw.wmnet [15:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:36] !log elukey@cumin1001 START - Cookbook sre.hosts.reimage for host ml-serve2008.codfw.wmnet with OS bullseye [15:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:04:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1110 (T298565)', diff saved to https://phabricator.wikimedia.org/P25372 and previous config saved to /var/cache/conftool/dbconfig/20220419-150454-ladsgroup.json [15:04:57] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:04:58] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [15:04:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:00] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:37] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/784259 (https://phabricator.wikimedia.org/T306148) (owner: 10Muehlenhoff) [15:06:30] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1114.eqiad.wmnet with reason: Rebooting for T303174 [15:06:32] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1114.eqiad.wmnet with reason: Rebooting for T303174 [15:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:38] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25373 and previous config saved to /var/cache/conftool/dbconfig/20220419-150637-kormat.json [15:06:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:06:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P25374 and previous config saved to /var/cache/conftool/dbconfig/20220419-150649-ladsgroup.json [15:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:10] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1182.eqiad.wmnet with reason: Rebooting for T303174 [15:07:12] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1182.eqiad.wmnet with reason: Rebooting for T303174 [15:07:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:17] !log kormat@cumin1001 dbctl commit (dc=all): 'db1182 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25375 and previous config saved to /var/cache/conftool/dbconfig/20220419-150717-kormat.json [15:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2009.codfw.wmnet [15:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:37] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10Ottomata) Approved [15:08:41] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.033 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:09:13] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3065 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:09:15] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2010.codfw.wmnet [15:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:31] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) p:05Triage→03High [15:09:35] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:09:37] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [15:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [15:09:40] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2006.codfw.wmnet with reason: host reimage [15:09:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [15:09:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:12] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2007.codfw.wmnet with reason: host reimage [15:10:14] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2005.codfw.wmnet with OS bullseye [15:10:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:20] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25376 and previous config saved to /var/cache/conftool/dbconfig/20220419-151019-kormat.json [15:10:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:54] (03CR) 10Volans: "For context in production the oldest version of util-linux is 2.29.2-1+deb9u1 (see https://debmonitor.wikimedia.org/packages/util-linux)." [puppet] - 10https://gerrit.wikimedia.org/r/780907 (owner: 10JHathaway) [15:11:03] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.08065 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:11:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:11:34] (03CR) 10Cwhite: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [15:11:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [15:11:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:08] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2006.codfw.wmnet with reason: host reimage [15:13:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:44] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1027.eqiad.wmnet with reason: Rebooting for T303174 [15:15:45] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1027.eqiad.wmnet with reason: Rebooting for T303174 [15:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:52] !log kormat@cumin1001 dbctl commit (dc=all): 'es1027 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25377 and previous config saved to /var/cache/conftool/dbconfig/20220419-151552-kormat.json [15:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:53] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2007.codfw.wmnet with reason: host reimage [15:15:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [15:16:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [15:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25378 and previous config saved to /var/cache/conftool/dbconfig/20220419-151607-ladsgroup.json [15:16:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:17:49] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1001.eqiad.wmnet [15:17:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:51] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host wdqs2010.codfw.wmnet [15:17:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2011.codfw.wmnet [15:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25379 and previous config saved to /var/cache/conftool/dbconfig/20220419-151847-ladsgroup.json [15:18:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:11] !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-serve2008.codfw.wmnet with reason: host reimage [15:19:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:20:36] (03CR) 10Volans: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [15:21:18] !log kormat@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25380 and previous config saved to /var/cache/conftool/dbconfig/20220419-152117-kormat.json [15:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143', diff saved to https://phabricator.wikimedia.org/P25381 and previous config saved to /var/cache/conftool/dbconfig/20220419-152154-ladsgroup.json [15:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:14] !log ariel@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dumpsdata1001.eqiad.wmnet [15:22:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:52] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:23:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:23:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:24:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:24:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:16] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [15:24:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:24:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:33] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:45] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-serve2008.codfw.wmnet with reason: host reimage [15:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:50] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:24:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:56] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2006.codfw.wmnet with OS bullseye [15:24:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:01] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:25:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:10] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:24] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25382 and previous config saved to /var/cache/conftool/dbconfig/20220419-152523-kormat.json [15:25:25] !log ariel@cumin1001 START - Cookbook sre.hosts.reboot-single for host dumpsdata1003.eqiad.wmnet [15:25:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:22] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:54] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:58] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:27:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:20] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2007.codfw.wmnet with OS bullseye [15:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:39] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:28:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:45] (JobUnavailable) resolved: (3) Reduced availability for job calico-felix in k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:29:31] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host wdqs2011.codfw.wmnet [15:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:38] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 3.628 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:29:44] (03PS1) 10Elukey: Change coredns IP for ml-serve-codfw after cluster re-init [deployment-charts] - 10https://gerrit.wikimedia.org/r/784275 (https://phabricator.wikimedia.org/T304673) [15:33:29] !log start rdb2008 from mgmt console (was powered down for relocation) [15:33:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:54] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:34:14] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 2.773 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:35:09] (03CR) 10Cwhite: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [15:35:10] !log andrew@cumin1001 START - Cookbook sre.hosts.reimage for host cloudvirt1019.eqiad.wmnet with OS bullseye [15:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:16] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 7.636 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:35:22] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 94, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:35:30] !log ariel@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dumpsdata1003.eqiad.wmnet [15:35:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:46] RECOVERY - Host rdb2008 is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [15:36:22] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 127, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:36:22] !log kormat@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25383 and previous config saved to /var/cache/conftool/dbconfig/20220419-153621-kormat.json [15:36:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:48] !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-serve2008.codfw.wmnet with OS bullseye [15:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:56] PROBLEM - Check health of redis instance on 6378 on rdb2008 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6378 https://wikitech.wikimedia.org/wiki/Redis [15:37:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1143 (T298565)', diff saved to https://phabricator.wikimedia.org/P25384 and previous config saved to /var/cache/conftool/dbconfig/20220419-153659-ladsgroup.json [15:37:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:37:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [15:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:04] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:37:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25385 and previous config saved to /var/cache/conftool/dbconfig/20220419-153707-ladsgroup.json [15:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:34] !log otto@deploy1002 Started deploy [analytics/refinery@f136555]: weekly train [15:37:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:09] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) 1. Good to know, I thought that we needed manual intervention to create ne... [15:39:03] (03CR) 10JHathaway: smart_data_dump: Use lsblk's json output (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780907 (owner: 10JHathaway) [15:39:06] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [15:39:11] (03CR) 10JHathaway: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [15:39:19] !log powercycle elastic1097 (still with role::insetup, but not reachable via ssh or mgmt console) [15:39:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:18] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [15:40:20] RECOVERY - Check health of redis instance on 6378 on rdb2008 is OK: OK: REDIS 6.0.14 on 127.0.0.1:6378 has 1 databases (db0) with 18775664 keys, up 4 minutes 40 seconds https://wikitech.wikimedia.org/wiki/Redis [15:40:28] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25386 and previous config saved to /var/cache/conftool/dbconfig/20220419-154027-kormat.json [15:40:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:33] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [15:42:46] (03CR) 10Cwhite: grafana: provision JSON datasource (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/774380 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [15:43:21] (03CR) 10Volans: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [15:44:50] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) Regarding next steps: Currently we have an interim swift container just t... [15:45:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10elukey) 05Resolved→03Open Hi! I see the following when rebooting elastic1097: ` UEFI0058: Uncorrectable Memory Error has occurred becau... [15:47:05] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: host reimage [15:47:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:32] (03CR) 10Elukey: [C: 03+2] Change coredns IP for ml-serve-codfw after cluster re-init [deployment-charts] - 10https://gerrit.wikimedia.org/r/784275 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [15:48:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25387 and previous config saved to /var/cache/conftool/dbconfig/20220419-154806-ladsgroup.json [15:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:11] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [15:48:12] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephmon2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [15:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25388 and previous config saved to /var/cache/conftool/dbconfig/20220419-154850-ladsgroup.json [15:48:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:03] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1019.eqiad.wmnet with reason: host reimage [15:50:04] (03PS1) 10Jcrespo: admin: Add Essex Igyan access to analytics-privetedata-users group [puppet] - 10https://gerrit.wikimedia.org/r/784279 (https://phabricator.wikimedia.org/T305948) [15:50:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:06] (03PS23) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [15:50:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:50:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:47] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:50:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:50] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [15:51:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:51:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:16] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:26] !log kormat@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25389 and previous config saved to /var/cache/conftool/dbconfig/20220419-155125-kormat.json [15:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:40] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on es1026.eqiad.wmnet with reason: Rebooting for T303174 [15:51:41] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on es1026.eqiad.wmnet with reason: Rebooting for T303174 [15:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:47] !log kormat@cumin1001 dbctl commit (dc=all): 'es1026 depooling: Rebooting for T303174', diff saved to https://phabricator.wikimedia.org/P25390 and previous config saved to /var/cache/conftool/dbconfig/20220419-155146-kormat.json [15:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:52:13] (03CR) 10Jcrespo: "Hey, Riccardo- on your own time, can you review this and merge it yourself to close T305948? Only merging it is pending and I guess you ma" [puppet] - 10https://gerrit.wikimedia.org/r/784279 (https://phabricator.wikimedia.org/T305948) (owner: 10Jcrespo) [15:52:38] (03CR) 10JHathaway: [C: 03+2] smart_data_dump: Use lsblk's json output [puppet] - 10https://gerrit.wikimedia.org/r/780907 (owner: 10JHathaway) [15:52:51] (03CR) 10Cwhite: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [15:54:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:54:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:31] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:54:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:54:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:42] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:54:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:15] (03CR) 10Volans: [C: 03+2] "LGTM, thanks for the patch @jcrespo." [puppet] - 10https://gerrit.wikimedia.org/r/784279 (https://phabricator.wikimedia.org/T305948) (owner: 10Jcrespo) [15:55:31] !log kormat@cumin1001 dbctl commit (dc=all): 'db1114 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25391 and previous config saved to /var/cache/conftool/dbconfig/20220419-155531-kormat.json [15:55:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:36] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:39] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [15:57:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:46] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [15:57:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:50] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:57:50] (Primary inbound port utilisation over 80% #page) firing: Alert for device cr2-eqord.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [15:58:01] woop [15:58:08] 👋 [15:58:11] 👋 [15:58:20] here [15:58:30] here [15:58:42] here [15:58:53] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10Volans) @eigyan the access request has been merged, it will be deployed within the next 30 minutes. Please resolve this task once confirmed... [15:58:56] it's the eqord/eqiad transport link [15:59:01] it's been hot for hours [15:59:05] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to analytics-privatedata-users for Essex Igyan eigyan - https://phabricator.wikimedia.org/T305948 (10jcrespo) @eigyan No harm done!, it was just that when I first saw the gmail account and no name, I thought it was a volunteer asking for acc... [15:59:15] looks to correlate with this also https://librenms.wikimedia.org/graphs/to=1650383700/id=16841/type=port_bits/from=1650297300/ [15:59:34] wow [15:59:36] (03CR) 10Cwhite: [C: 03+1] sre: add alerts for exporter-specific unavailability (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/778259 (https://phabricator.wikimedia.org/T288726) (owner: 10Filippo Giunchedi) [15:59:39] yep, out to ntt [15:59:44] * akosiaris around [15:59:55] to AS16509 [15:59:55] !log otto@deploy1002 Finished deploy [analytics/refinery@f136555]: weekly train (duration: 22m 21s) [15:59:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:05] jbond and rzl: Dear deployers, time to do the Puppet request window deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:25] the read requests hasn't changed. Upload scrapper? [16:00:36] probably [16:00:42] no [16:00:44] labstore1006. [16:00:57] no spcific IP but something in the 53.72.0.0/13 range stands out [16:01:00] https://w.wiki/556V [16:01:07] ? [16:01:20] akosiaris: the source of traffic is labstore1006 [16:01:22] 1006 is dumps, right? [16:01:25] dumps? [16:01:52] ok, that's.. unexpected [16:02:02] we have ratelimiting to dumps [16:02:15] someone probably is bypassing it [16:02:23] Amir1: if it is per-IP, ... [16:02:25] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for AAssaf - https://phabricator.wikimedia.org/T306437 (10Volans) @dr0ptp4kt could you please clarify if this access request (and the other related to the same project) is instead for the NDA group more than the WMF one? The NDA seems more approriate for... [16:02:26] don't we have... a better way to connect to AWS than via transit in ord? [16:02:42] (03PS1) 10Jgiannelos: tegola: Point to codfw s3 endpoint for debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/784284 [16:02:49] * volans here too [16:02:50] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:02:50] (Primary inbound port utilisation over 80% #page) resolved: Device cr2-eqord.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:03:02] I did get that page but I also called in sick. seeing there are others here and it's labstore1006 ... and it resolved. ok [16:03:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P25392 and previous config saved to /var/cache/conftool/dbconfig/20220419-160311-ladsgroup.json [16:03:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:16] mutante: go rest, we got it <3 [16:03:33] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.202 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:03:39] does look like it's coming down again [16:03:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25393 and previous config saved to /var/cache/conftool/dbconfig/20220419-160355-ladsgroup.json [16:03:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:04:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [16:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25394 and previous config saved to /var/cache/conftool/dbconfig/20220419-160409-ladsgroup.json [16:04:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:14] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:05:24] (03CR) 10Hnowlan: [C: 03+1] tegola: Point to codfw s3 endpoint for debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/784284 (owner: 10Jgiannelos) [16:06:02] (03PS1) 10MusikAnimal: DeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) [16:06:30] !log kormat@cumin1001 dbctl commit (dc=all): 'es1027 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25395 and previous config saved to /var/cache/conftool/dbconfig/20220419-160629-kormat.json [16:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:18] !log kormat@cumin1001 START - Cookbook sre.hosts.downtime for 1:30:00 on db1182.eqiad.wmnet with reason: Rebooting for T303174 [16:07:19] !log kormat@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:30:00 on db1182.eqiad.wmnet with reason: Rebooting for T303174 [16:07:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:07:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:07:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:07:49] (03CR) 10Jgiannelos: [C: 03+2] tegola: Point to codfw s3 endpoint for debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/784284 (owner: 10Jgiannelos) [16:07:56] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:17] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:08:47] !log otto@deploy1002 Started deploy [analytics/refinery@f136555] (thin): Regular analytics weekly train THIN [analytics/refinery@f136555] [16:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:08:55] !log otto@deploy1002 Finished deploy [analytics/refinery@f136555] (thin): Regular analytics weekly train THIN [analytics/refinery@f136555] (duration: 00m 07s) [16:08:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:30] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1019.eqiad.wmnet with OS bullseye [16:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:43] !log otto@deploy1002 Started deploy [analytics/refinery@f136555] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f136555] [16:09:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25396 and previous config saved to /var/cache/conftool/dbconfig/20220419-160948-ladsgroup.json [16:09:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:09:52] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:11:30] !log kormat@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 25%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25397 and previous config saved to /var/cache/conftool/dbconfig/20220419-161129-kormat.json [16:11:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:05] (03Abandoned) 10MusikAnimal: DeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) (owner: 10MusikAnimal) [16:13:23] (03Merged) 10jenkins-bot: tegola: Point to codfw s3 endpoint for debugging. [deployment-charts] - 10https://gerrit.wikimedia.org/r/784284 (owner: 10Jgiannelos) [16:13:44] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [16:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:02] (03Restored) 10MusikAnimal: DeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) (owner: 10MusikAnimal) [16:14:34] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [16:14:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:59] (03PS2) 10MusikAnimal: DeletePage, UndeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) [16:15:10] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [16:15:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:16:32] !log otto@deploy1002 Finished deploy [analytics/refinery@f136555] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@f136555] (duration: 06m 49s) [16:16:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:17:45] (Primary outbound port utilisation over 80% #page) resolved: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314', diff saved to https://phabricator.wikimedia.org/P25398 and previous config saved to /var/cache/conftool/dbconfig/20220419-161816-ladsgroup.json [16:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157', diff saved to https://phabricator.wikimedia.org/P25399 and previous config saved to /var/cache/conftool/dbconfig/20220419-161901-ladsgroup.json [16:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:19:58] (03PS1) 10Ladsgroup: dumps: Block python requests UA [puppet] - 10https://gerrit.wikimedia.org/r/784288 [16:21:25] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 4.353 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:21:51] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/T304461.php --wiki=cswiki --delete # T304461 [16:21:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:21:55] T304461: Delete `growthexperiments-mentor-id` properties from user_properties - https://phabricator.wikimedia.org/T304461 [16:22:53] RECOVERY - Maps HTTPS on maps1008 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 7.555 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:23:14] (03CR) 10JHathaway: [C: 03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/784288 (owner: 10Ladsgroup) [16:23:40] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/T304461.php --wiki=kowiki --delete # T304461 [16:23:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:44] (03CR) 10Ladsgroup: [C: 03+2] dumps: Block python requests UA [puppet] - 10https://gerrit.wikimedia.org/r/784288 (owner: 10Ladsgroup) [16:24:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25400 and previous config saved to /var/cache/conftool/dbconfig/20220419-162453-ladsgroup.json [16:24:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:05] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:26:34] !log kormat@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 50%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25401 and previous config saved to /var/cache/conftool/dbconfig/20220419-162633-kormat.json [16:26:37] (03PS1) 10Jgiannelos: Revert "tegola: Point to codfw s3 endpoint for debugging." [deployment-charts] - 10https://gerrit.wikimedia.org/r/783912 [16:26:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:09] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 2.597 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:27:54] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:27:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:03] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:28:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host wdqs2012.codfw.wmnet [16:28:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:45] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [16:28:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:57] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [16:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:15] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [16:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host wdqs2012.codfw.wmnet [16:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25402 and previous config saved to /var/cache/conftool/dbconfig/20220419-163321-ladsgroup.json [16:33:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:33:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [16:33:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:26] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1157 (T298565)', diff saved to https://phabricator.wikimedia.org/P25403 and previous config saved to /var/cache/conftool/dbconfig/20220419-163406-ladsgroup.json [16:34:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [16:34:09] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [16:34:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25404 and previous config saved to /var/cache/conftool/dbconfig/20220419-163414-ladsgroup.json [16:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:59] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.004 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [16:37:13] (KubernetesRsyslogDown) firing: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [16:38:49] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [16:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:59] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25405 and previous config saved to /var/cache/conftool/dbconfig/20220419-163958-ladsgroup.json [16:40:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:41:38] !log kormat@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25406 and previous config saved to /var/cache/conftool/dbconfig/20220419-164137-kormat.json [16:41:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:42:11] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudcephmon2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [16:42:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1148.eqiad.wmnet with reason: Maintenance [16:42:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1148 (T298565)', diff saved to https://phabricator.wikimedia.org/P25407 and previous config saved to /var/cache/conftool/dbconfig/20220419-164216-ladsgroup.json [16:42:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:42:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:46:23] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [16:46:57] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10fgiunchedi) >>! In T306424#7864446, @Jgiannelos wrote: > 1. Good to know, I thought th... [16:47:44] (03PS1) 10Nskaggs: dumps: Add email to UA block [puppet] - 10https://gerrit.wikimedia.org/r/784290 [16:48:42] (03CR) 10Andrew Bogott: [C: 03+2] dumps: Add email to UA block [puppet] - 10https://gerrit.wikimedia.org/r/784290 (owner: 10Nskaggs) [16:50:01] RECOVERY - Maps HTTPS on maps1005 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.848 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [16:50:50] 10ops-eqiad, 10DC-Ops: hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10RobH) [16:51:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25408 and previous config saved to /var/cache/conftool/dbconfig/20220419-165150-ladsgroup.json [16:51:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:51:55] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [16:51:57] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:52:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): Q3:(Need By: TBD) rack/setup/install elastic1089-1102 - https://phabricator.wikimedia.org/T299609 (10RobH) 05Open→03Resolved >>! In T299609#7864485, @elukey wrote: > Hi! > > I see the following when rebooting elastic1097: > > ` > UEFI00... [16:53:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [16:53:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:06] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [16:53:07] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [16:53:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:11] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [16:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298565)', diff saved to https://phabricator.wikimedia.org/P25409 and previous config saved to /var/cache/conftool/dbconfig/20220419-165311-ladsgroup.json [16:53:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25410 and previous config saved to /var/cache/conftool/dbconfig/20220419-165503-ladsgroup.json [16:55:05] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:55:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [16:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25411 and previous config saved to /var/cache/conftool/dbconfig/20220419-165511-ladsgroup.json [16:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:56:24] (03CR) 10Cwhite: [C: 03+1] "LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/782359 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [16:56:27] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:56:42] !log kormat@cumin1001 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: Reboot T303174', diff saved to https://phabricator.wikimedia.org/P25412 and previous config saved to /var/cache/conftool/dbconfig/20220419-165641-kormat.json [16:56:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:00] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10RobH) a:03Cmjohnson [16:57:04] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) The problem with starting a new container from scratch is that we rely on... [16:57:10] 10ops-eqiad, 10DC-Ops, 10Discovery-Search (Current work): hw troubleshooting: memory error for elastic1097 - https://phabricator.wikimedia.org/T306449 (10RobH) This is a newly racked host so this could just require reseating to clear it up, as the memory can unseat during shipment. If reseating doesn't fix... [16:57:57] (Outbound discards) firing: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [16:59:17] RECOVERY - Maps HTTPS on maps1007 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 8.945 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:00:29] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:00:53] RECOVERY - Maps HTTPS on maps1010 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 8.966 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:02:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25413 and previous config saved to /var/cache/conftool/dbconfig/20220419-170202-ladsgroup.json [17:02:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:02:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:02:45] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudcephmon2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:02:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:07] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 9.931 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:04:50] 10SRE-swift-storage, 10Data-Persistence, 10Maps, 10Product-Infrastructure-Team-Backlog, 10Patch-For-Review: Tegola pods are crashing because swift doesnt allow connections - https://phabricator.wikimedia.org/T306424 (10Jgiannelos) Is it an option to bootstrap a new container from backups ? [17:06:15] 10SRE-swift-storage: Test Commons doesn't show any images - https://phabricator.wikimedia.org/T306139 (10Ladsgroup) My hypothesis: The normalization script works for every domain except *.wikimedia.org for lots of reasons and that's why it bit us. we do have a lot of wikis with - in their langcode (zh-classical... [17:06:46] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10dom_walden) This is happening again. I am also seeing: ` Request from 52.225.87.246 via deployment-cache-text06 d... [17:06:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25414 and previous config saved to /var/cache/conftool/dbconfig/20220419-170655-ladsgroup.json [17:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:55] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:58] (Outbound discards) resolved: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Outbound discards - https://alerts.wikimedia.org/?q=alertname%3DOutbound+discards [17:08:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P25415 and previous config saved to /var/cache/conftool/dbconfig/20220419-170816-ladsgroup.json [17:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:43] (03PS5) 10Juan90264: Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) [17:10:46] (03CR) 10jerkins-bot: [V: 04-1] Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) (owner: 10Juan90264) [17:11:19] (03CR) 10jerkins-bot: [V: 04-1] Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) (owner: 10Juan90264) [17:11:25] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [17:11:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:59] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudnet2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:12:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:12:03] (03CR) 10Dmaza: [C: 03+1] DeletePage, UndeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) (owner: 10MusikAnimal) [17:13:46] (03PS6) 10Juan90264: Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) [17:14:27] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:14:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:19] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.005 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:17:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25416 and previous config saved to /var/cache/conftool/dbconfig/20220419-171707-ladsgroup.json [17:17:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:18:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:51] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Agree how to handle port-block speeds for QFX5120-48Y - https://phabricator.wikimedia.org/T303529 (10cmooney) So to confirm it the configuration detailed above does not work: ` mooney@cloudsw1-e4-eqiad> show configuration chassis | display... [17:18:55] I guess I'll ask about https://gerrit.wikimedia.org/r/783911 since wikibugs logged it here... merging that means it'll go out with wmf.8, right, since it isn't live yet? Dmaza apparently doesn't have +2 rights to that patch, so hoping someone else can +2 it for me. Otherwise we'll backport it later [17:19:51] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10cmooney) @Jclark-ctr that's great. I've been able to finish off the testing. Feel free to remove those cables and close off this task. Thanks :) [17:19:59] nemo-yiannis: re: tegola and swift, I have to go now but let's resume tomorrow (to quickly answer your question, there's no backups of that data no) [17:20:07] ok [17:20:13] 10SRE, 10Beta-Cluster-Infrastructure, 10Traffic, 10Beta-Cluster-reproducible: Beta cluster down: Error: 502, Next Hop Connection Failed (Feb 2022) - https://phabricator.wikimedia.org/T302699 (10Zabe) ` zabe@deployment-mediawiki12:~$ sudo tail /var/log/apache2.log Apr 19 17:13:55 deployment-mediawiki12 apac... [17:20:14] lets resume tomorrow [17:21:54] (03PS1) 10Phedenskog: grafana: Fix performance team JSON proxy. [puppet] - 10https://gerrit.wikimedia.org/r/784294 (https://phabricator.wikimedia.org/T304583) [17:22:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25417 and previous config saved to /var/cache/conftool/dbconfig/20220419-172200-ladsgroup.json [17:22:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148', diff saved to https://phabricator.wikimedia.org/P25418 and previous config saved to /var/cache/conftool/dbconfig/20220419-172321-ladsgroup.json [17:23:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:00] (03PS1) 10Jeena Huneidi: testwikis wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784295 [17:25:02] (03CR) 10Jeena Huneidi: [C: 03+2] testwikis wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784295 (owner: 10Jeena Huneidi) [17:25:19] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 3.547 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:25:47] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 7.201 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:26:11] (03Merged) 10jenkins-bot: testwikis wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784295 (owner: 10Jeena Huneidi) [17:28:03] (03PS1) 10Elukey: role::ml_k8s::{master,worker}: update coredns IP [puppet] - 10https://gerrit.wikimedia.org/r/784296 (https://phabricator.wikimedia.org/T304673) [17:29:02] (03CR) 10Elukey: [C: 03+2] role::ml_k8s::{master,worker}: update coredns IP [puppet] - 10https://gerrit.wikimedia.org/r/784296 (https://phabricator.wikimedia.org/T304673) (owner: 10Elukey) [17:31:17] (03CR) 10Btullis: [C: 03+1] zookeeper: migrate zookeeper-cleanup cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/777451 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [17:31:47] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:31:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:08] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:32:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25419 and previous config saved to /var/cache/conftool/dbconfig/20220419-173212-ladsgroup.json [17:32:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:15] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:32:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:21] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [17:32:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:32:45] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:33:26] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:33:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:29] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [17:33:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [17:33:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [17:33:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [17:33:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:10] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/T304461.php --wiki=bnwiki --delete # T304461 [17:36:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:16] T304461: Delete `growthexperiments-mentor-id` properties from user_properties - https://phabricator.wikimedia.org/T304461 [17:37:02] 10SRE, 10ops-codfw: Dell switches testing - https://phabricator.wikimedia.org/T290133 (10Papaul) Step 3 connect 2 links 1 from dell-spine1 to Juniper QFX switch and another one from dell-spine2 to Juniper QFX switch as well. lsw3 et/0/0/50 dell-spine1 Ethernet 104 lsw3 et/0/0/52 dell-spine2 Ethernet 104 [17:37:06] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25420 and previous config saved to /var/cache/conftool/dbconfig/20220419-173706-ladsgroup.json [17:37:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:10] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:37:56] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:37:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:06] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/T304461.php --wiki=arwiki --delete # T304461 [17:38:06] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:12] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:38:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1148 (T298565)', diff saved to https://phabricator.wikimedia.org/P25421 and previous config saved to /var/cache/conftool/dbconfig/20220419-173827-ladsgroup.json [17:38:29] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:38:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1149.eqiad.wmnet with reason: Maintenance [17:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1149 (T298565)', diff saved to https://phabricator.wikimedia.org/P25422 and previous config saved to /var/cache/conftool/dbconfig/20220419-173836-ladsgroup.json [17:38:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:44] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:38:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:52] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [17:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudnet2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [17:39:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:09] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [17:39:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:17] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:09] RECOVERY - PyBal backends health check on lvs2009 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:42:33] RECOVERY - PyBal backends health check on lvs2010 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal [17:45:59] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 2.756 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:47:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25423 and previous config saved to /var/cache/conftool/dbconfig/20220419-174717-ladsgroup.json [17:47:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:47:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1161.eqiad.wmnet with reason: Maintenance [17:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:47:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:47:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25424 and previous config saved to /var/cache/conftool/dbconfig/20220419-174731-ladsgroup.json [17:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:50:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298565)', diff saved to https://phabricator.wikimedia.org/P25425 and previous config saved to /var/cache/conftool/dbconfig/20220419-175021-ladsgroup.json [17:50:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:51] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.046 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [17:54:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:54:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1179.eqiad.wmnet with reason: Maintenance [17:54:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25426 and previous config saved to /var/cache/conftool/dbconfig/20220419-175431-ladsgroup.json [17:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:36] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [17:56:41] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [18:00:05] jeena and brennen: Your horoscope predicts another unfortunate MediaWiki train - Utc-7 Version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T1800). [18:03:33] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudnet2006-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:03:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:04:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:04] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:04:05] train is blocked, also investigating some errors with deployment tooling [18:04:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:04:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:04:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:24] o/ [18:04:26] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudservices2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:15] !log train 1.38.0-wmf.9 (T305214): we're currently debugging some scap / train prep issues. [18:05:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:18] T305214: 1.39.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T305214 [18:05:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P25427 and previous config saved to /var/cache/conftool/dbconfig/20220419-180525-ladsgroup.json [18:05:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:05] 10ops-eqiad: Port with no description on access switch - https://phabricator.wikimedia.org/T306129 (10wiki_willy) a:03Cmjohnson [18:10:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25428 and previous config saved to /var/cache/conftool/dbconfig/20220419-181047-ladsgroup.json [18:10:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:53] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:15:41] 10SRE, 10Performance-Team, 10Traffic, 10serviceops: Potential navtiming_responseStart regression as of 13 Mar 2022 - https://phabricator.wikimedia.org/T303782 (10Peter) I'll just check Chrome vs Safari on mobile. When 100 rolled out I saw this https://phabricator.wikimedia.org/T305122#7838322 on WebPageTes... [18:20:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149', diff saved to https://phabricator.wikimedia.org/P25429 and previous config saved to /var/cache/conftool/dbconfig/20220419-182031-ladsgroup.json [18:20:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:54] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10Bethany) @Volans Approved. June 30 is the contract end date. Thanks! [18:23:56] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10jmads) > in the meanwhile you can (re?)read https://wikitech.wikimedia.org/wiki/Analytics/Data_access#User_responsibilities re-read! [18:25:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25430 and previous config saved to /var/cache/conftool/dbconfig/20220419-182552-ladsgroup.json [18:25:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:49] !log jhuneidi@deploy1002 Started scap: testwikis wikis to 1.39.0-wmf.8 refs T305214 [18:27:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:53] T305214: 1.39.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T305214 [18:31:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudservices2004-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:19] (03PS2) 10JHathaway: smart_data_dump: skip over iDRAC devices [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) [18:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:34:05] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudservices2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:34:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:16] (03CR) 10JHathaway: smart_data_dump: skip over iDRAC devices (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [18:34:25] (03PS1) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:35:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1149 (T298565)', diff saved to https://phabricator.wikimedia.org/P25431 and previous config saved to /var/cache/conftool/dbconfig/20220419-183536-ladsgroup.json [18:35:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:35:39] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [18:35:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:41] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25432 and previous config saved to /var/cache/conftool/dbconfig/20220419-183544-ladsgroup.json [18:35:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:35:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:00] (03PS2) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:39:37] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [18:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:40] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [18:39:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [18:39:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:45] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [18:39:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:39:48] (03PS3) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:40:50] (03PS4) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:40:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179', diff saved to https://phabricator.wikimedia.org/P25433 and previous config saved to /var/cache/conftool/dbconfig/20220419-184057-ladsgroup.json [18:41:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:41:15] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 9.497 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:41:39] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.089 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:42:27] (03PS5) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:45:06] (03PS6) 10CDanis: WIP [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:45:41] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2365 bytes in 1.653 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:47:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25434 and previous config saved to /var/cache/conftool/dbconfig/20220419-184745-ladsgroup.json [18:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:51] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:48:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25435 and previous config saved to /var/cache/conftool/dbconfig/20220419-184801-ladsgroup.json [18:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:09] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [18:48:37] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:49:53] 10SRE, 10MediaWiki-General, 10Performance-Team, 10serviceops-radar, and 5 others: Move MainStash out of Redis to a simpler multi-dc aware solution - https://phabricator.wikimedia.org/T212129 (10Krinkle) Next: Decide on how and whether to fragment the data in mainstashdb, e.g. like parser cache, like extern... [18:50:53] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 7.415 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:53:18] (03PS7) 10CDanis: Proof of concept for haproxy statistics tracking [puppet] - 10https://gerrit.wikimedia.org/r/784309 [18:53:54] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudservices2005-dev.mgmt.codfw.wmnet with reboot policy FORCED [18:53:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:54:11] (03CR) 10CDanis: "PCC looks good: https://puppet-compiler.wmflabs.org/pcc-worker1001/34900/" [puppet] - 10https://gerrit.wikimedia.org/r/784309 (owner: 10CDanis) [18:55:49] 10ops-eqiad: elastic1097 Failed DIMM slot A2 - https://phabricator.wikimedia.org/T306462 (10Cmjohnson) [18:56:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1179 (T298565)', diff saved to https://phabricator.wikimedia.org/P25436 and previous config saved to /var/cache/conftool/dbconfig/20220419-185602-ladsgroup.json [18:56:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [18:56:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [18:56:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2105.codfw.wmnet with reason: Maintenance [18:56:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 6 hosts with reason: Maintenance [18:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:14] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 6 hosts with reason: Maintenance [18:56:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:57:51] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [18:59:25] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.013 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:00:31] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: Upgrading Elasticsearch to 6.8 in EQIAD - bking@cumin1001 - T301959 [19:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:36] T301959: Upgrade Search elasticsearch cluster / eqiad to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301959 [19:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:02:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25437 and previous config saved to /var/cache/conftool/dbconfig/20220419-190250-ladsgroup.json [19:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P25438 and previous config saved to /var/cache/conftool/dbconfig/20220419-190306-ladsgroup.json [19:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:28] !log bking@cumin1001 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: Upgrading Elasticsearch to 6.8 in EQIAD - bking@cumin1001 - T301959 [19:09:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:36] T301959: Upgrade Search elasticsearch cluster / eqiad to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301959 [19:10:05] !log jhuneidi@deploy1002 Finished scap: testwikis wikis to 1.39.0-wmf.8 refs T305214 (duration: 42m 16s) [19:10:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:09] T305214: 1.39.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T305214 [19:14:24] (03PS1) 10Bking: elastic: increase recovery time [cookbooks] - 10https://gerrit.wikimedia.org/r/784310 (https://phabricator.wikimedia.org/T305994) [19:14:44] !log pt1979@cumin2002 START - Cookbook sre.dns.netbox [19:14:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:48] (03PS2) 10Ryan Kemper: elastic: increase recovery time [cookbooks] - 10https://gerrit.wikimedia.org/r/784310 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [19:14:50] (03PS1) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) [19:14:53] (03CR) 10Gehel: [C: 03+2] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/784310 (https://phabricator.wikimedia.org/T305994) (owner: 10Bking) [19:15:37] !log jhuneidi@deploy1002 Pruned MediaWiki: 1.39.0-wmf.6 (duration: 01m 31s) [19:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161', diff saved to https://phabricator.wikimedia.org/P25439 and previous config saved to /var/cache/conftool/dbconfig/20220419-191756-ladsgroup.json [19:17:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:18:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314', diff saved to https://phabricator.wikimedia.org/P25440 and previous config saved to /var/cache/conftool/dbconfig/20220419-191812-ladsgroup.json [19:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:19:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:19:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [19:20:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:03] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [19:20:04] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [19:20:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [19:20:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:23] !log bking@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: Upgrading Elasticsearch to 6.8 in EQIAD - bking@cumin1001 - T301959 [19:20:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:26] T301959: Upgrade Search elasticsearch cluster / eqiad to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301959 [19:20:53] !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host cloudweb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:20:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:21:36] (03CR) 10Cwhite: [C: 03+2] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/784294 (https://phabricator.wikimedia.org/T304583) (owner: 10Phedenskog) [19:23:19] (03CR) 10Krinkle: [C: 03+1] "Since this feature hasn't been used in prod before, I enabled it locally to confirm that it still works as intended. I created [[Foo]] -> " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/780636 (https://phabricator.wikimedia.org/T305782) (owner: 10MarcoAurelio) [19:23:24] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Andrew) [19:25:38] (03CR) 10Majavah: "what about wgUploadPath in IS.php?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [19:26:23] (03CR) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [19:28:03] (03CR) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [19:28:36] (03CR) 10Cwhite: "Thanks, all!" [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [19:28:45] (03CR) 10Cwhite: [C: 03+1] smart_data_dump: skip over iDRAC devices [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [19:29:19] (03CR) 10Majavah: filebackend: Fix link to thumb url in testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [19:31:11] (03PS2) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) [19:31:28] (03CR) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) (owner: 10Ladsgroup) [19:31:32] (03PS3) 10Ladsgroup: filebackend: Fix link to thumb url in testcommonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784311 (https://phabricator.wikimedia.org/T306139) [19:31:41] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2365 bytes in 1.867 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:33:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1161 (T298565)', diff saved to https://phabricator.wikimedia.org/P25441 and previous config saved to /var/cache/conftool/dbconfig/20220419-193301-ladsgroup.json [19:33:03] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:33:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [19:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:07] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:33:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25442 and previous config saved to /var/cache/conftool/dbconfig/20220419-193309-ladsgroup.json [19:33:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3314 (T298565)', diff saved to https://phabricator.wikimedia.org/P25443 and previous config saved to /var/cache/conftool/dbconfig/20220419-193318-ladsgroup.json [19:33:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:33:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:33:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:34:55] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/T304461.php --wiki=viwiki --delete # T304461 [19:34:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:00] T304461: Delete `growthexperiments-mentor-id` properties from user_properties - https://phabricator.wikimedia.org/T304461 [19:35:50] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/GrowthExperiments/maintenance/T304461.php --wiki=frwiki --delete # T304461 [19:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:36:22] (03PS7) 10Juan90264: Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) [19:38:43] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [19:39:46] okay, the few wikis completed just fine, running it everywhere [19:39:59] !log [urbanecm@mwmaint1002 ~]$ foreachwikiindblist growthexperiments extensions/GrowthExperiments/maintenance/T304461.php --delete # T304461 [19:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:03] T304461: Delete `growthexperiments-mentor-id` properties from user_properties - https://phabricator.wikimedia.org/T304461 [19:40:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25444 and previous config saved to /var/cache/conftool/dbconfig/20220419-194008-ladsgroup.json [19:40:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:40:12] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:42:08] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:42:10] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1145.eqiad.wmnet with reason: Maintenance [19:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:31] (03PS3) 10JHathaway: smart_data_dump: skip over iDRAC devices [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) [19:46:58] 10SRE: Allow Wikimedia Maps usage on a private project for an university. - https://phabricator.wikimedia.org/T306467 (10HermidaVazquez) [19:47:57] (03CR) 10JHathaway: [C: 03+2] smart_data_dump: skip over iDRAC devices [puppet] - 10https://gerrit.wikimedia.org/r/780990 (https://phabricator.wikimedia.org/T294564) (owner: 10JHathaway) [19:49:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:49:23] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:17] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host cloudweb2002-dev.mgmt.codfw.wmnet with reboot policy FORCED [19:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:44] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [19:50:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1138.eqiad.wmnet with reason: Maintenance [19:50:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25445 and previous config saved to /var/cache/conftool/dbconfig/20220419-195050-ladsgroup.json [19:50:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:54] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [19:52:01] PROBLEM - SSH on wtp1038.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [19:55:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25446 and previous config saved to /var/cache/conftool/dbconfig/20220419-195513-ladsgroup.json [19:55:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:55:26] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [19:59:43] (03PS6) 10Cwhite: logstash: populate target index format and add pipeline diagnostics [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) [19:59:45] (03PS5) 10Cwhite: logstash: set partition on legacy indexes [puppet] - 10https://gerrit.wikimedia.org/r/777880 (https://phabricator.wikimedia.org/T305175) [19:59:47] (03PS4) 10Cwhite: logstash: transform rotation frequency values to datestamp format [puppet] - 10https://gerrit.wikimedia.org/r/777882 (https://phabricator.wikimedia.org/T305175) [19:59:49] (03PS4) 10Cwhite: logstash: rewrite ecs settings [puppet] - 10https://gerrit.wikimedia.org/r/777887 (https://phabricator.wikimedia.org/T305013) [20:00:04] RoanKattouw and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20220419T2000). [20:00:04] tgr and musikanimal: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:15] here! [20:00:21] (03PS9) 10Ebernhardson: elastic: Restart masters one at a time after all others [software/spicerack] - 10https://gerrit.wikimedia.org/r/781009 (https://phabricator.wikimedia.org/T306389) [20:00:26] hello! [20:00:27] i can deploy today! [20:00:44] tgr_: hi, are you around? [20:01:02] (03CR) 10Urbanecm: [C: 03+2] DeletePage, UndeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) (owner: 10MusikAnimal) [20:02:16] musikanimal: I'll ping you once it merges (and can be tested). [20:03:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25447 and previous config saved to /var/cache/conftool/dbconfig/20220419-200303-ladsgroup.json [20:03:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:09] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:03:49] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2365 bytes in 1.579 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:04:48] o/ [20:05:06] hey tgr! [20:05:43] tgr: is it intentional that the B variant is missing from the wgGECampaignPattern? [20:06:11] it's the control group so not expected to do anything [20:06:28] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: 2M 25G DAC testing - https://phabricator.wikimedia.org/T306220 (10Jclark-ctr) 05Open→03Resolved a:05cmooney→03Jclark-ctr Thanks Removed both cables closing task [20:06:37] okay. wasn't sure if we need to do anything for it to get recorded, etc. [20:06:47] (03PS2) 10Urbanecm: Add video marketing campaign to $wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783449 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:06:51] (03CR) 10Urbanecm: [C: 03+2] Add video marketing campaign to $wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783449 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:07:00] instrumentation happens via the Campaigns extension AFAIK [20:07:24] okay, great [20:07:44] (03Merged) 10jenkins-bot: Add video marketing campaign to $wgGECampaignPattern [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783449 (https://phabricator.wikimedia.org/T303785) (owner: 10Gergő Tisza) [20:07:58] I'll have one more patch (a backport) in a sec. I can deploy it if preferred. [20:08:11] (03CR) 10Cwhite: logstash: populate target index format and add pipeline diagnostics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/775375 (https://phabricator.wikimedia.org/T305090) (owner: 10Cwhite) [20:08:11] tgr: config's at mwdebug if you want to have a look [20:08:38] tgr: ad backport deployment, up to you. I'll be deploying musik's core backport, so i can do yours as well, or just let you know. [20:09:05] not realy testable until the train reaches eswiki, I think [20:09:22] ah, okay [20:09:25] well in that case, syncing [20:09:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Infrastructure-Foundations, 10netops: Q2:(Need By: TBD) replace mr1-eqiad - https://phabricator.wikimedia.org/T294474 (10Jclark-ctr) cable has been removed pinged on irc [20:10:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25448 and previous config saved to /var/cache/conftool/dbconfig/20220419-201018-ladsgroup.json [20:10:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:36] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:10:37] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:41] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:43] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: f55f817: Add video marketing campaign to $wgGECampaignPattern (T303785) (duration: 00m 54s) [20:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:45] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.009 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:10:48] and, live [20:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:10:50] T303785: Account creation: social media landing pages - https://phabricator.wikimedia.org/T303785 [20:11:08] (03PS1) 10Gergő Tisza: Revert "Skip welcome surveys for users in the no-homepage control group" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783916 (https://phabricator.wikimedia.org/T305015) [20:11:08] * urbanecm waiting on CI for r783911 [20:11:20] (03PS3) 10Raymond Ndibe: Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) [20:12:20] tgr: is r783916 the backport to deploy? [20:12:46] yeah. Could you also deploy that? [20:12:51] (03CR) 10jerkins-bot: [V: 04-1] Create REST api service to manage toolforge replica.my.cnf [puppet] - 10https://gerrit.wikimedia.org/r/777037 (https://phabricator.wikimedia.org/T304040) (owner: 10Raymond Ndibe) [20:12:59] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 8.136 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:13:01] sure thing [20:13:05] (03CR) 10Urbanecm: [C: 03+2] Revert "Skip welcome surveys for users in the no-homepage control group" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783916 (https://phabricator.wikimedia.org/T305015) (owner: 10Gergő Tisza) [20:13:07] thanks! [20:13:11] np [20:13:39] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:14:23] PROBLEM - SSH on pki2001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:14:27] in theory that's not easily deployable (changes a common hook class ctor signature), but given wmf.8 is only testwiki, it should be fine [20:14:33] (03CR) 10Urbanecm: [C: 03+2] Revert "Skip welcome surveys for users in the no-homepage control group" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783916 (https://phabricator.wikimedia.org/T305015) (owner: 10Gergő Tisza) [20:15:38] Hello [20:15:43] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:15:46] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:47] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:51] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:15:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:58] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes1018:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [20:17:15] hello Juan_90264 [20:17:26] don't forget me [20:18:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P25449 and previous config saved to /var/cache/conftool/dbconfig/20220419-201808-ladsgroup.json [20:18:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:18:34] Please F5 in Deployments of WikiTech [20:18:53] Juan_90264: there was no patch when the window started, so, thanks for the ping. [20:19:05] Okay [20:19:12] (03PS8) 10Urbanecm: Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) (owner: 10Juan90264) [20:19:17] (03CR) 10Urbanecm: [C: 03+2] Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) (owner: 10Juan90264) [20:19:55] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:20:21] (03Merged) 10jenkins-bot: Add extendedconfirmed on elwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/783910 (https://phabricator.wikimedia.org/T306241) (owner: 10Juan90264) [20:20:34] (03Merged) 10jenkins-bot: DeletePage, UndeletePage: use plaintextParams when creating log message [core] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783911 (https://phabricator.wikimedia.org/T306431) (owner: 10MusikAnimal) [20:20:45] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:20:46] 1001 or 1002 [20:20:49] Juan_90264: can you check your config please? [20:20:50] ? [20:20:51] at mwdebug1001 [20:21:05] Yes, I will check [20:21:22] musikanimal: your patch is at mwdebug1001, can you check please? [20:21:29] sure thing [20:22:19] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 9.583 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:22:41] urbanecm: I checked and approved [20:22:54] Juan_90264: syncing [20:23:05] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:23:42] urbanecm: looks good! [20:23:48] musikanimal: thanks, will sync too! [20:24:31] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 0a877710be56a06721d128868fd991b74e1f54a9: Add extendedconfirmed on elwiki (T306241) (duration: 00m 50s) [20:24:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:36] T306241: Set extendedconfirmed protection level/user group for elwiki - https://phabricator.wikimedia.org/T306241 [20:25:23] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25450 and previous config saved to /var/cache/conftool/dbconfig/20220419-202523-ladsgroup.json [20:25:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:28] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:25:41] Thanks Urbanecm for deploying! [20:25:57] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:00] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:26:01] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:26:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:05] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:26:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:07] np [20:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:12] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [20:26:13] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1096.eqiad.wmnet with reason: Maintenance [20:26:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25451 and previous config saved to /var/cache/conftool/dbconfig/20220419-202618-ladsgroup.json [20:26:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:55] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.8/includes/page/DeletePage.php: f1ebd29: DeletePage, UndeletePage: use plaintextParams when creating log message (T306431; 1/2) (duration: 00m 50s) [20:26:55] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [20:26:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:26:59] T306431: Templates get transcluded in (un)delete reason for associated talk page - https://phabricator.wikimedia.org/T306431 [20:27:45] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.8/includes/page/UndeletePage.php: f1ebd29: DeletePage, UndeletePage: use plaintextParams when creating log message (T306431; 2/2) (duration: 00m 50s) [20:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:28:00] musikanimal: should be live. anything else? [20:28:30] that is all. https://test.wikipedia.org/wiki/Special:Version still says it's at a4aaf3c75 though [20:28:35] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.300 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:29:05] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:29:06] though I just tested it again, the commit is in there. So I guess we're good :) [20:29:36] musikanimal: yeah, i don't think the git cache updates on backports :) [20:29:46] okay, I figured it was something like that. Thanks! [20:29:47] as long as it works, we should be fine :) [20:29:53] yup, it does! [20:29:59] great! [20:31:12] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:31:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:31:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:47] PROBLEM - BGP status on cr2-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [20:33:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25452 and previous config saved to /var/cache/conftool/dbconfig/20220419-203301-ladsgroup.json [20:33:05] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.007 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:33:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:33:14] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138', diff saved to https://phabricator.wikimedia.org/P25453 and previous config saved to /var/cache/conftool/dbconfig/20220419-203313-ladsgroup.json [20:33:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [20:34:11] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1166.eqiad.wmnet with reason: Maintenance [20:34:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25454 and previous config saved to /var/cache/conftool/dbconfig/20220419-203416-ladsgroup.json [20:34:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:21] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:35:23] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 3.256 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:40:24] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Jim Maddock - https://phabricator.wikimedia.org/T249873 (10KFrancis) Hi all, as Jim Maddock is a contractor with the WMF, I am confirming the NDA. Please proceed with the access request. [20:42:23] (03Merged) 10jenkins-bot: Revert "Skip welcome surveys for users in the no-homepage control group" [extensions/GrowthExperiments] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783916 (https://phabricator.wikimedia.org/T305015) (owner: 10Gergő Tisza) [20:43:48] finally [20:44:39] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:44:42] tgr: pulled to mwdebug1001 if you want to have a look [20:45:28] (03PS1) 10Zabe: admin: Update email address for Zabe [puppet] - 10https://gerrit.wikimedia.org/r/784320 [20:46:15] looking [20:46:17] (03PS24) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [20:46:30] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [20:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [20:46:35] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [20:46:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [20:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:46:53] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [20:46:55] RECOVERY - Device not healthy -SMART- on aqs1007 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/d/000000377/host-overview?var-server=aqs1007&var-datasource=eqiad+prometheus/ops [20:47:24] urbanecm: looks good, thanks! [20:47:28] thanks, syncing [20:47:37] (03PS25) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [20:48:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25455 and previous config saved to /var/cache/conftool/dbconfig/20220419-204806-ladsgroup.json [20:48:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:12] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [20:48:19] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1138 (T298565)', diff saved to https://phabricator.wikimedia.org/P25456 and previous config saved to /var/cache/conftool/dbconfig/20220419-204818-ladsgroup.json [20:48:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [20:48:22] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1147.eqiad.wmnet with reason: Maintenance [20:48:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [20:48:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P25457 and previous config saved to /var/cache/conftool/dbconfig/20220419-204826-ladsgroup.json [20:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:00] !log urbanecm@deploy1002 Synchronized php-1.39.0-wmf.8/extensions/GrowthExperiments/: e152df0: Revert "Skip welcome surveys for users in the no-homepage control group" (T305015) (duration: 00m 55s) [20:49:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:04] T305015: Welcome emails: reserve control group - https://phabricator.wikimedia.org/T305015 [20:49:06] tgr: should be live [20:49:09] anything else, anyone? [20:51:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25458 and previous config saved to /var/cache/conftool/dbconfig/20220419-205143-ladsgroup.json [20:51:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:15] (03PS26) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [20:52:28] i guess not [20:52:32] !log UTC late B&C window done [20:52:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:52:40] PROBLEM - SSH on wtp1045.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [20:52:48] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [20:52:54] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 3.793 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [20:57:08] (03PS27) 10Vivian Rook: pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 [20:57:43] (03CR) 10jerkins-bot: [V: 04-1] pcc commit do not merge [puppet] - 10https://gerrit.wikimedia.org/r/782107 (owner: 10Vivian Rook) [20:57:54] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.320 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:01:48] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:02:44] (KubernetesRsyslogDown) firing: (4) rsyslog on ml-staging-ctrl2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [21:03:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315', diff saved to https://phabricator.wikimedia.org/P25459 and previous config saved to /var/cache/conftool/dbconfig/20220419-210311-ladsgroup.json [21:03:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:11] (03PS1) 10Zabe: memcached: migrate memkeys cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/784323 (https://phabricator.wikimedia.org/T273673) [21:05:13] (03PS1) 10Zabe: memcached: remove absented memkeys cron [puppet] - 10https://gerrit.wikimedia.org/r/784324 (https://phabricator.wikimedia.org/T273673) [21:06:20] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:06:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P25460 and previous config saved to /var/cache/conftool/dbconfig/20220419-210648-ladsgroup.json [21:06:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:07:54] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 9.995 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:08:14] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.864 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [21:12:54] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 263 bytes in 1.006 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:13:30] PROBLEM - Maps HTTPS on maps1006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [21:15:08] RECOVERY - SSH on pki2001.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:15:56] PROBLEM - SSH on mw2258.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:16:40] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 4.467 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:18:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1096:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25462 and previous config saved to /var/cache/conftool/dbconfig/20220419-211817-ladsgroup.json [21:18:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [21:18:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1144.eqiad.wmnet with reason: Maintenance [21:18:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:23] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:18:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25463 and previous config saved to /var/cache/conftool/dbconfig/20220419-211824-ladsgroup.json [21:18:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:21:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P25464 and previous config saved to /var/cache/conftool/dbconfig/20220419-212153-ladsgroup.json [21:21:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:02] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:23:28] PROBLEM - BGP status on cr1-eqiad is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:25:10] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 2.515 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:25:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25465 and previous config saved to /var/cache/conftool/dbconfig/20220419-212514-ladsgroup.json [21:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:20] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:25:31] !log set index.unassigned.node_left.delayed_timeout to 10m for all indices in elasticsearch psi (:9200) cluster [21:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:28] (03PS1) 10Jforrester: Hooks: return false rather than strings on failure [extensions/LdapAuthentication] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783917 (https://phabricator.wikimedia.org/T305786) [21:32:10] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:34:22] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 7.373 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:35:16] Going to do a backport for https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LdapAuthentication/+/783917/ [21:35:28] (03CR) 10Jeena Huneidi: [C: 03+2] Hooks: return false rather than strings on failure [extensions/LdapAuthentication] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783917 (https://phabricator.wikimedia.org/T305786) (owner: 10Jforrester) [21:37:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T298565)', diff saved to https://phabricator.wikimedia.org/P25466 and previous config saved to /var/cache/conftool/dbconfig/20220419-213658-ladsgroup.json [21:37:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:37:03] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1175.eqiad.wmnet with reason: Maintenance [21:37:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:06] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:37:08] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25467 and previous config saved to /var/cache/conftool/dbconfig/20220419-213707-ladsgroup.json [21:37:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:33] (03Merged) 10jenkins-bot: Hooks: return false rather than strings on failure [extensions/LdapAuthentication] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/783917 (https://phabricator.wikimedia.org/T305786) (owner: 10Jforrester) [21:40:20] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25468 and previous config saved to /var/cache/conftool/dbconfig/20220419-214019-ladsgroup.json [21:40:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:15] !log jhuneidi@deploy1002 Synchronized php-1.39.0-wmf.8/extensions/LdapAuthentication/includes/LdapAuthenticationHooks.php: Backport: [[gerrit:783917|Hooks: return false rather than strings on failure (T305786)]] (duration: 01m 30s) [21:41:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:19] T305786: Fatal exception of type "UnexpectedValueException" when attempting to block - https://phabricator.wikimedia.org/T305786 [21:42:11] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [21:42:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:15] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [21:42:16] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [21:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [21:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:55] (NodeTextfileStale) firing: Stale textfile for cloudcontrol2001-dev:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:48:06] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:48:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P25469 and previous config saved to /var/cache/conftool/dbconfig/20220419-214841-ladsgroup.json [21:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:47] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [21:53:32] RECOVERY - SSH on wtp1038.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:53:54] RECOVERY - SSH on wtp1045.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [21:54:52] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2366 bytes in 3.496 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:55:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315', diff saved to https://phabricator.wikimedia.org/P25470 and previous config saved to /var/cache/conftool/dbconfig/20220419-215525-ladsgroup.json [21:55:28] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 4.256 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [21:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:14] !log set indices.recovery.max_bytes_per_sec=240mb in elasticsearch-eqiad-psi [21:58:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:50] PROBLEM - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P25471 and previous config saved to /var/cache/conftool/dbconfig/20220419-220346-ladsgroup.json [22:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:57] (03PS1) 10Papaul: Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/784331 (https://phabricator.wikimedia.org/T304881) [22:05:40] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [22:07:00] PROBLEM - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:09:32] (03CR) 10Papaul: [C: 03+2] Add new cloud nodes to site.pp and netboot.cfg [puppet] - 10https://gerrit.wikimedia.org/r/784331 (https://phabricator.wikimedia.org/T304881) (owner: 10Papaul) [22:10:30] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1144:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25472 and previous config saved to /var/cache/conftool/dbconfig/20220419-221030-ladsgroup.json [22:10:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [22:10:33] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1113.eqiad.wmnet with reason: Maintenance [22:10:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:35] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25473 and previous config saved to /var/cache/conftool/dbconfig/20220419-221038-ladsgroup.json [22:10:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:46] RECOVERY - LVS tegola-vector-tiles eqiad port 4105/tcp - Tegola Vector Tiles- tegola-vector-tiles.svc.eqiad.wmnet IPv4 on tegola-vector-tiles.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 2365 bytes in 1.416 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:13:50] All train blockers have been resolved so I will roll to group 0 shortly [22:14:15] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [22:14:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcephmon2005-dev.codfw... [22:16:33] (03PS1) 10Jeena Huneidi: group0 wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784333 [22:16:36] (03CR) 10Jeena Huneidi: [C: 03+2] group0 wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784333 (owner: 10Jeena Huneidi) [22:17:00] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 3.322 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:17:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25474 and previous config saved to /var/cache/conftool/dbconfig/20220419-221701-ladsgroup.json [22:17:02] RECOVERY - SSH on mw2258.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [22:17:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:08] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:17:18] (03Merged) 10jenkins-bot: group0 wikis to 1.39.0-wmf.8 refs T305214 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/784333 (owner: 10Jeena Huneidi) [22:18:52] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147', diff saved to https://phabricator.wikimedia.org/P25475 and previous config saved to /var/cache/conftool/dbconfig/20220419-221851-ladsgroup.json [22:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:18:58] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.39.0-wmf.8 refs T305214 [22:19:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:03] T305214: 1.39.0-wmf.8 deployment blockers - https://phabricator.wikimedia.org/T305214 [22:20:49] (03CR) 10Andrew Bogott: "I'm adding some absenting logic to this. I don't disagree with your thought about running it off of the haproxy server but the logic to ha" [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [22:21:31] (03PS4) 10Andrew Bogott: prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [22:22:04] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [22:22:44] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:47] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:22:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:22:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:22:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:42] PROBLEM - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:25:40] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [22:32:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25476 and previous config saved to /var/cache/conftool/dbconfig/20220419-223206-ladsgroup.json [22:32:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:32:55] (NodeTextfileStale) firing: (3) Stale textfile for elastic1075:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:33:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1147 (T298565)', diff saved to https://phabricator.wikimedia.org/P25477 and previous config saved to /var/cache/conftool/dbconfig/20220419-223356-ladsgroup.json [22:34:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:34:02] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [22:36:06] (03PS5) 10Andrew Bogott: prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [22:36:16] RECOVERY - Maps HTTPS on maps1006 is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 6.819 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [22:36:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [22:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25478 and previous config saved to /var/cache/conftool/dbconfig/20220419-223722-ladsgroup.json [22:37:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:37:35] (03CR) 10jerkins-bot: [V: 04-1] prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [22:38:24] RECOVERY - LVS kartotherian-ssl eqiad port 443/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet - HTTPS IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 5.102 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:39:25] (03PS6) 10Andrew Bogott: prometheus-openstack-exporter: only run it on the primary server [puppet] - 10https://gerrit.wikimedia.org/r/779516 (https://phabricator.wikimedia.org/T302178) (owner: 10Arturo Borrero Gonzalez) [22:40:10] RECOVERY - LVS kartotherian eqiad port 6533/tcp - Kartotherian- kartotherian.svc.eqiad.wmnet IPv4 on kartotherian.svc.eqiad.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1287 bytes in 0.670 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [22:40:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2005-dev.codfw.wmnet with reason: host reimage [22:40:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:41] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: pcc: add project-proxy puppetmaster key [puppet] - 10https://gerrit.wikimedia.org/r/781956 (owner: 10Majavah) [22:42:37] !log bking@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: Upgrading Elasticsearch to 6.8 in EQIAD - bking@cumin1001 - T301959 [22:42:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:41] T301959: Upgrade Search elasticsearch cluster / eqiad to elasticsearch 6.8.23 - https://phabricator.wikimedia.org/T301959 [22:42:54] (03CR) 10Andrew Bogott: [C: 03+2] hieradata: update openstack to use ldap-rw hostnames [puppet] - 10https://gerrit.wikimedia.org/r/784086 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [22:47:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315', diff saved to https://phabricator.wikimedia.org/P25479 and previous config saved to /var/cache/conftool/dbconfig/20220419-224711-ladsgroup.json [22:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:50:56] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye [22:50:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:02] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcephmon2005-dev.codfw.wmnet with OS bullseye co... [22:51:15] (03CR) 10Andrew Bogott: [C: 03+2] openstack: cleanup enc api remains from puppetmasters [puppet] - 10https://gerrit.wikimedia.org/r/781962 (https://phabricator.wikimedia.org/T295247) (owner: 10Majavah) [22:52:28] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25480 and previous config saved to /var/cache/conftool/dbconfig/20220419-225227-ladsgroup.json [22:52:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:05] !log mwdebug-deploy@deploy1002 helmfile [eqiad] START helmfile.d/services/mwdebug: apply [22:53:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:08] !log mwdebug-deploy@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mwdebug: apply [22:53:09] !log mwdebug-deploy@deploy1002 helmfile [codfw] START helmfile.d/services/mwdebug: apply [22:53:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:13] !log mwdebug-deploy@deploy1002 helmfile [codfw] DONE helmfile.d/services/mwdebug: apply [22:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye [22:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:39] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye [22:54:54] (03CR) 10Andrew Bogott: [C: 03+2] P:openldap: remove 'labs' branding [puppet] - 10https://gerrit.wikimedia.org/r/776191 (https://phabricator.wikimedia.org/T295150) (owner: 10Majavah) [22:56:54] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2005-dev.codfw.wmnet with OS bullseye [22:56:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:57:00] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudnet2005-dev.codfw.wmnet with OS bullseye [22:58:13] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [23:01:55] (NodeTextfileStale) firing: Stale textfile for ms-be2067:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:02:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1113:3315 (T298565)', diff saved to https://phabricator.wikimedia.org/P25481 and previous config saved to /var/cache/conftool/dbconfig/20220419-230218-ladsgroup.json [23:02:20] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [23:02:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1130.eqiad.wmnet with reason: Maintenance [23:02:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:24] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:02:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25482 and previous config saved to /var/cache/conftool/dbconfig/20220419-230226-ladsgroup.json [23:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:05:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1130 (T298565)', diff saved to https://phabricator.wikimedia.org/P25483 and previous config saved to /var/cache/conftool/dbconfig/20220419-230459-ladsgroup.json [23:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:07:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P25484 and previous config saved to /var/cache/conftool/dbconfig/20220419-230732-ladsgroup.json [23:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:09:32] (03CR) 10Andrew Bogott: [C: 03+2] wikitech: migrate mw-xml cron to systemd timer job [puppet] - 10https://gerrit.wikimedia.org/r/781053 (https://phabricator.wikimedia.org/T273673) (owner: 10Zabe) [23:12:47] (03PS1) 10Arlolra: Commit changes from update --no-dev before bumping parsoid [vendor] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/784340 [23:12:49] (03PS1) 10Arlolra: Bump parsoid to 0.16.0-a6 [vendor] (wmf/1.39.0-wmf.8) - 10https://gerrit.wikimedia.org/r/784341 (https://phabricator.wikimedia.org/T305641) [23:15:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [23:15:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:57] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon2006-dev.codfw.wmnet with reason: host reimage [23:19:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:23] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [23:20:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T298565)', diff saved to https://phabricator.wikimedia.org/P25485 and previous config saved to /var/cache/conftool/dbconfig/20220419-232237-ladsgroup.json [23:22:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [23:22:41] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1112.eqiad.wmnet with reason: Maintenance [23:22:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:42] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:22:43] T298565: Fix mismatching field type of user table for columns user_email_authenticated, user_email_token, user_email_token_expires, user_newpass_time, user_registration, user_token, user_touched, user_newpassword, user_password, user_email on wmf wikis - https://phabricator.wikimedia.org/T298565 [23:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:46] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [23:22:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1112 (T298565)', diff saved to https://phabricator.wikimedia.org/P25486 and previous config saved to /var/cache/conftool/dbconfig/20220419-232250-ladsgroup.json [23:22:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:22:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:42] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudnet2005-dev.codfw.wmnet with reason: host reimage [23:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:28] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [23:28:53] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye [23:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:29:00] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudcephmon2006-dev.codfw.wmnet with OS bullseye co... [23:30:47] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudservices2004-dev.wikimedia.org with OS bullseye [23:30:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:30:52] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudservices2004-dev.wikimedia.org with OS bull... [23:32:09] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [23:34:28] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudnet2005-dev.codfw.wmnet with OS bullseye [23:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:34] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host cloudnet2005-dev.codfw.wmnet with OS bullseye comple... [23:34:48] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host cloudnet2006-dev.codfw.wmnet with OS bullseye [23:34:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:54] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host cloudnet2006-dev.codfw.wmnet with OS bullseye [23:45:55] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) [23:46:47] 10SRE, 10ops-codfw, 10DC-Ops, 10cloud-services-team (Hardware): Q3:(Need By: TBD) rack/setup/install 7 wmcs hosts - https://phabricator.wikimedia.org/T304881 (10Papaul) second interface added for cloudnet2005 ` [edit interfaces] + ge-1/0/24 { + description cloudnet2005-dev; + unit 0 { +... [23:49:42] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudservices2004-dev.wikimedia.org with reason: host reimage [23:49:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:04] PROBLEM - Number of backend failures per minute from CirrusSearch on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [600.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cluster=eqiad&var-smoothing=1&viewPanel=9 [23:50:15] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:50:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1003.eqiad.wmnet with reason: Maintenance [23:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:50:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:20] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudservices2004-dev.wikimedia.org with reason: host reimage [23:54:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [23:54:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2123.codfw.wmnet with reason: Maintenance [23:54:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:49] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on 8 hosts with reason: Maintenance [23:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:55] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on 8 hosts with reason: Maintenance [23:54:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [23:56:42] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [23:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:56:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudnet2006-dev.codfw.wmnet with reason: host reimage [23:56:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log