[01:08:52] PROBLEM - MariaDB Replica Lag: pc1 on pc2021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 301.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [01:12:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306013 [01:12:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306013 (owner: 10TrainBranchBot) [01:20:39] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1306013 (owner: 10TrainBranchBot) [02:00:35] !log mwpresync@deploy1003 Started scap build-images: Publishing wmf/next image [02:07:38] !log mwpresync@deploy1003 Finished scap build-images: Publishing wmf/next image (duration: 07m 03s) [02:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:10:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [02:14:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [02:27:12] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2003 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:46:52] RECOVERY - MariaDB Replica Lag: pc1 on pc2021 is OK: OK slave_sql_lag Replication lag: 58.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/Troubleshooting%23Incident_Response [03:27:12] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext-next_hourly on cumin2003 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext-next_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:47:53] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12061493 (10Fuyo21) In zhwiki, some Chinese characters will break the timeline, eg, "六"“八”. [04:04:41] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:09:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:39:41] FIRING: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [04:43:04] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2010 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:44:02] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2010 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:44:04] PROBLEM - Blazegraph Port for wdqs-blazegraph on wdqs2014 is CRITICAL: connect to address 127.0.0.1 and port 9999: Connection refused https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:45:04] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2014 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:49:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:41:13] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [05:41:43] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [05:41:44] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [05:42:15] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [05:46:08] 06SRE-OnFire, 10Citoid: Large increase in citoid latency starting on June 25/ ~ 21 UTC - present - https://phabricator.wikimedia.org/T430279#12061588 (10Mvolz) p:05High→03Medium [05:59:49] RESOLVED: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:10:49] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [06:32:51] 10SRE-swift-storage, 10EasyTimeline: "Timeline error. Could not store output files" - https://phabricator.wikimedia.org/T428063#12061630 (10lado85) >>! In T428063#12059463, @WindEwriX wrote: > At least for some cases in ruwiki timeline is broken by small cyrillic letter х, but if it is replaced by mnemonic ... [06:44:49] !log cscott@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-parsoid: apply [06:44:52] !log cscott@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-parsoid: apply [06:44:53] !log cscott@deploy1003 helmfile [codfw] START helmfile.d/services/mw-parsoid: apply [06:44:56] !log cscott@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-parsoid: apply [07:58:29] (03CR) 10Arnaudb: [C:03+1] "looks good to me. good idea to exclude them, thanks for the change!" [puppet] - 10https://gerrit.wikimedia.org/r/1305937 (https://phabricator.wikimedia.org/T430018) (owner: 10Andrew Bogott) [08:36:39] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [09:24:13] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thank you! LGTM" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1300846 (https://phabricator.wikimedia.org/T418655) (owner: 10Pppery) [10:08:40] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:11:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [12:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:44:41] FIRING: CoreRouterInterfaceDown: Core router interface down - cr4-ulsfo:et-0/0/0 (Transport: Arelion (IC-398715) {#temp_IC-398715}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr4-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:03:25] RESOLVED: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:11:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [16:09:41] FIRING: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:14:41] RESOLVED: [2x] JobUnavailable: Reduced availability for job sidekiq in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [17:29:11] (03PS1) 10VadymTS1: User groups changes for English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) [17:30:05] (03CR) 10CI reject: [V:04-1] User groups changes for English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) (owner: 10VadymTS1) [17:32:05] (03PS2) 10VadymTS1: User groups changes for English Wikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) [17:33:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, June 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1306056 (https://phabricator.wikimedia.org/T430416) (owner: 10VadymTS1) [18:11:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [18:57:36] (03PS1) 101Veertje: Fix credit encoding, remove photo by prefix, correct Instrument of surrender credit [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306061 [18:59:00] (03PS1) 101Veertje: Fix 'see below' credit for USS New Jersey to Navy Dept. (War Dept.) [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306062 [19:04:10] (03Abandoned) 101Veertje: Fix 'see below' credit for USS New Jersey to Navy Dept. (War Dept.) [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306062 (owner: 101Veertje) [19:04:23] (03Abandoned) 101Veertje: Fix credit encoding, remove photo by prefix, correct Instrument of surrender credit [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306061 (owner: 101Veertje) [19:06:29] (03PS1) 101Veertje: Add daily background rotation with attribution and credit fixes [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306063 [19:07:43] (03PS1) 101Veertje: Add daily background rotation with attribution and credit fixes [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306064 [19:12:25] (03PS1) 101Veertje: Add daily background rotation with attribution and credit fixes [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306065 [19:15:39] (03Abandoned) 101Veertje: Add daily background rotation with attribution and credit fixes [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306063 (owner: 101Veertje) [19:15:46] (03Abandoned) 101Veertje: Add daily background rotation with attribution and credit fixes [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1306064 (owner: 101Veertje) [19:16:14] (03Abandoned) 101Veertje: Add daily background rotation with attribution and footer polish [software/gerrit] (wmf/stable-3.10) - 10https://gerrit.wikimedia.org/r/1305801 (owner: 101Veertje) [20:36:54] FIRING: [2x] CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [21:37:04] (03CR) 10Krinkle: [C:03+1] mediawiki-common: add TTL to gauge metrics produced by statslib [deployment-charts] - 10https://gerrit.wikimedia.org/r/1305523 (https://phabricator.wikimedia.org/T394956) (owner: 10Cwhite) [22:11:04] FIRING: HelmReleaseBadStatus: Helm release wdqs/main-internal on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=wdqs - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:22:17] FIRING: [2x] NodeBGPSessionStatusNotEstablished: Kubernetes node dse-k8s-worker1023:0 has a BGP session which is not in the 'established' state. - https://wikitech.wikimedia.org/wiki/Kubernetes/Administration#NodeBGPSessionStatusNotEstablished - https://alerts.wikimedia.org/?q=alertname%3DNodeBGPSessionStatusNotEstablished [22:48:25] FIRING: SystemdUnitFailed: send_tile_invalidations.service on maps1011:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:59:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.1% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:04:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-web releases routed via main at eqiad: 24.1% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-web&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [23:42:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306070 [23:42:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306070 (owner: 10TrainBranchBot) [23:50:20] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1306070 (owner: 10TrainBranchBot) [23:52:22] FIRING: CertAlmostExpired: gNMI TLS certificate for ssw1-d8-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:57:22] FIRING: [2x] CertAlmostExpired: gNMI TLS certificate for ssw1-d1-eqiad.mgmt.eqiad.wmnet is going to expire in 0s - https://wikitech.wikimedia.org/wiki/Network_monitoring#CertAlmostExpired - https://grafana.wikimedia.org/d/eab73c60-a402-4f9b-a4a7-ea489b374458/gnmic?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired