Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't initializate Envoy v1.16+ with CDS message #14119

Closed
maksim-paskal opened this issue Nov 20, 2020 · 10 comments · Fixed by #14382
Closed

Can't initializate Envoy v1.16+ with CDS message #14119

maksim-paskal opened this issue Nov 20, 2020 · 10 comments · Fixed by #14382
Assignees
Milestone

Comments

@maksim-paskal
Copy link

maksim-paskal commented Nov 20, 2020

We have our implementation of go-control-plane, it's work great on envoy v1.15.2 - but upgrading to envoy v1.16.0 - got a Caught Segmentation fault on CDS message with aggregated cluster

# envoyproxy/envoy-alpine-debug:v1.15.2 - works
# envoyproxy/envoy-alpine-debug:v1.16.0 - not-works
# envoyproxy/envoy-alpine-debug:v1.16.1 - not-works
docker run \
  -it \
  --rm \
  -v $(pwd)/config/:/etc/envoy \
  -p 8000:8000 \
  -p 8001:8001 \
  envoyproxy/envoy-alpine-debug:v1.16.0 \
  --config-path /etc/envoy/envoy.yaml \
  --log-level warn \
  --log-format "%v" \
  --bootstrap-version 3 \
  --service-cluster test \
  --service-node test1-id \
  --service-zone test-zone

envoy.yaml

dynamic_resources:
  cds_config:
    path: /etc/envoy/cds.json
admin:
  access_log_path: "/dev/null"
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 8001
static_resources:
  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 8000
    traffic_direction: INBOUND
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          codec_type: AUTO
          rds:
            route_config_name: test
            config_source:
              path: "/etc/envoy/rds.json"
          http_filters:
          - name: envoy.filters.http.router

cds.json

{
  "versionInfo": "d8e61033-fec9-4565-8ce3-a40d7a565b90",
  "resources": [
    {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "local_service2",
      "type": "STRICT_DNS",
      "connectTimeout": "0.250s",
      "loadAssignment": {
        "clusterName": "local_service2",
        "endpoints": [
          {
            "locality": {
              "zone": "us-east-1a"
            },
            "lbEndpoints": [
              {
                "endpoint": {
                  "address": {
                    "socketAddress": {
                      "address": "nginxdemo-a",
                      "portValue": 80
                    }
                  }
                }
              }
            ]
          },
          {
            "locality": {
              "zone": "us-east-1b"
            },
            "lbEndpoints": [
              {
                "endpoint": {
                  "address": {
                    "socketAddress": {
                      "address": "nginxdemo-b",
                      "portValue": 80
                    }
                  }
                }
              }
            ]
          }
        ]
      },
      "healthChecks": [
        {
          "timeout": "1s",
          "interval": "5s",
          "unhealthyThreshold": 3,
          "healthyThreshold": 1,
          "httpHealthCheck": {
            "path": "/ready"
          }
        }
      ],
      "circuitBreakers": {
        "thresholds": [
          {
            "maxConnections": 10,
            "maxPendingRequests": 10,
            "maxRequests": 10
          }
        ]
      }
    },
    {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "admin_cluster",
      "type": "STATIC",
      "connectTimeout": "0.250s",
      "loadAssignment": {
        "clusterName": "admin_cluster",
        "endpoints": [
          {
            "lbEndpoints": [
              {
                "endpoint": {
                  "address": {
                    "socketAddress": {
                      "address": "127.0.0.1",
                      "portValue": 18000
                    }
                  }
                }
              }
            ]
          }
        ]
      }
    },
    {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "tls-cluster-example",
      "type": "STATIC",
      "connectTimeout": "1s",
      "loadAssignment": {
        "clusterName": "tls-cluster-example",
        "endpoints": [
          {
            "lbEndpoints": [
              {
                "endpoint": {
                  "address": {
                    "socketAddress": {
                      "address": "127.0.0.1",
                      "portValue": 443
                    }
                  }
                }
              }
            ]
          }
        ]
      },
      "healthChecks": [
        {
          "timeout": "1s",
          "interval": "5s",
          "unhealthyThreshold": 3,
          "healthyThreshold": 1,
          "httpHealthCheck": {
            "host": "some.com",
            "path": "/status.php"
          }
        }
      ],
      "transportSocket": {
        "name": "envoy.transport_sockets.tls",
        "typedConfig": {
          "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext"
        }
      }
    },
    {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "aggregate-cluster",
      "clusterType": {
        "name": "envoy.clusters.aggregate",
        "typedConfig": {
          "@type": "type.googleapis.com/envoy.extensions.clusters.aggregate.v3.ClusterConfig",
          "clusters": [
            "tls-cluster-example",
            "local_service2",
            "admin_cluster"
          ]
        }
      },
      "connectTimeout": "0.250s",
      "lbPolicy": "CLUSTER_PROVIDED"
    }
  ],
  "typeUrl": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
  "nonce": "1"
}

rds.json

{
  "versionInfo": "d8e61033-fec9-4565-8ce3-a40d7a565b90",
  "resources": [
    {
      "@type": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
      "name": "test",
      "virtualHosts": [
        {
          "name": "test",
          "domains": [
            "*"
          ],
          "routes": [
            {
              "match": {
                "prefix": "/1"
              },
              "route": {
                "cluster": "local_service1"
              }
            },
            {
              "match": {
                "prefix": "/2"
              },
              "route": {
                "cluster": "local_service2"
              }
            },
            {
              "match": {
                "prefix": "/tls-cluster-example"
              },
              "route": {
                "cluster": "tls-cluster-example"
              }
            },
            {
              "match": {
                "prefix": "/aggregate-cluster"
              },
              "route": {
                "cluster": "aggregate-cluster"
              }
            },
            {
              "match": {
                "prefix": "/"
              },
              "route": {
                "weightedClusters": {
                  "clusters": [
                    {
                      "name": "local_service1",
                      "weight": 50
                    },
                    {
                      "name": "local_service2",
                      "weight": 50
                    }
                  ]
                }
              }
            }
          ]
        }
      ]
    }
  ],
  "typeUrl": "type.googleapis.com/envoy.config.route.v3.RouteConfiguration",
  "nonce": "1"
}

envoy output

Envoy version: 8fb3cb86082b17144a80402f5367ae65f06083bd/1.16.0/Clean/RELEASE/BoringSSL
#0: [0x7ff2f02ae3d0]
#1: Envoy::ThreadLocal::InstanceImpl::runOnAllThreads() [0x557b008d1cd3]
#2: Envoy::ThreadLocal::InstanceImpl::SlotImpl::runOnAllThreads() [0x557b008d1aff]
#3: Envoy::Extensions::Clusters::Aggregate::Cluster::refresh() [0x557affce6339]
#4: Envoy::Extensions::Clusters::Aggregate::Cluster::startPreInit() [0x557affce60cd]
#5: Envoy::Upstream::ClusterImplBase::initialize() [0x557b00a85585]
#6: Envoy::Upstream::ClusterManagerInitHelper::initializeSecondaryClusters() [0x557b00937feb]
#7: Envoy::Upstream::ClusterManagerInitHelper::maybeFinishInitialize() [0x557b009377dc]
#8: Envoy::Upstream::ClusterManagerInitHelper::removeCluster() [0x557b00936afe]
#9: Envoy::Upstream::ClusterImplBase::finishInitialization() [0x557b00a85ad7]
#10: Envoy::Upstream::ClusterImplBase::onInitDone() [0x557b00a85933]
#11: Envoy::Init::WatcherHandleImpl::ready() [0x557b00cbf74b]
#12: Envoy::Init::ManagerImpl::initialize() [0x557b00cbc44b]
#13: Envoy::Upstream::ClusterImplBase::onPreInitComplete() [0x557b00a8580d]
#14: std::__1::__function::__func<>::operator()() [0x557b00aaa4e9]
#15: Envoy::Network::DnsResolverImpl::PendingResolution::onAresGetAddrInfoCallback() [0x557b009000d6]
#16: end_hquery [0x557b006e71d5]
#17: next_lookup [0x557b006e70b6]
#18: qcallback [0x557b006ea773]
#19: end_query [0x557b006e9b15]
#20: process_answer [0x557b006ea34b]
#21: processfds [0x557b006e8d50]
#22: std::__1::__function::__func<>::operator()() [0x557b009028a5]
#23: Envoy::Event::FileEventImpl::assignEvents()::$_1::__invoke() [0x557b008fc366]
#24: event_process_active_single_queue [0x557b00d34fe8]
#25: event_base_loop [0x557b00d339be]
#26: Envoy::Server::InstanceImpl::run() [0x557b008df79c]
#27: Envoy::MainCommonBase::run() [0x557affc3d008]
#28: Envoy::MainCommon::main() [0x557affc3d807]
#29: main [0x557affc3bbdc]
#30: __libc_start_main [0x7ff2f00fbc8d]
@maksim-paskal maksim-paskal added bug triage Issue requires triage labels Nov 20, 2020
@maksim-paskal maksim-paskal changed the title Can't initializate Envoy v1.16+ with cds message Can't initializate Envoy v1.16+ with CDS message Nov 21, 2020
@mattklein123 mattklein123 added area/aggregate_cluster investigate Potential bug that needs verification bug help wanted Needs help! and removed triage Issue requires triage bug investigate Potential bug that needs verification labels Nov 22, 2020
@mattklein123 mattklein123 self-assigned this Nov 22, 2020
@mattklein123
Copy link
Member

It's hard to tell what is going on here from the stack trace, though this may be a regression from #12833. I will take a look.

@mattklein123 mattklein123 added this to the 1.17.0 milestone Nov 22, 2020
@phlax
Copy link
Member

phlax commented Nov 23, 2020

is this a permissions issue with the cds.json file ?

@maksim-paskal
Copy link
Author

@phlax I have no permission issues - my case:

docker 19.03.13 - darwin/amd64

all files in /tmp/envoy-issue folder

cd /tmp/envoy-issue 

find /tmp/envoy-issue
/tmp/envoy-issue
/tmp/envoy-issue/config
/tmp/envoy-issue/config/envoy.yaml
/tmp/envoy-issue/config/cds.json
/tmp/envoy-issue/config/rds.json

docker run \
  -it \
  --rm \
  -v $(pwd)/config/:/etc/envoy \
  -p 8000:8000 \
  -p 8001:8001 \
  envoyproxy/envoy-alpine-debug:v1.16.0 \
  --config-path /etc/envoy/envoy.yaml \
  --log-level warn \
  --log-format "%v" \
  --bootstrap-version 3 \
  --service-cluster test \
  --service-node test1-id \
  --service-zone test-zone

envoy logs a Caught Segmentation fault

I thinks, its some dns issue on initialising secondary cluster

@phlax
Copy link
Member

phlax commented Nov 23, 2020

ah, k - apologies for noise

@EItanya
Copy link
Contributor

EItanya commented Dec 8, 2020

It's hard to tell what is going on here from the stack trace, though this may be a regression from #12833. I will take a look.

@mattklein123 could this be a result of this capture. Would this be easily transitioned to a shared_this?

@mattklein123
Copy link
Member

@mattklein123 could this be a result of this capture. Would this be easily transitioned to a shared_this?

I'm not sure. I don't see why that would be broken when that is what it was before, but I haven't debugged yet.

@EItanya
Copy link
Contributor

EItanya commented Dec 8, 2020

@mattklein123 could this be a result of this capture. Would this be easily transitioned to a shared_this?

I'm not sure. I don't see why that would be broken when that is what it was before, but I haven't debugged yet.

Fair point. For what it's worth we've been seeing a crash in Istio from the same pointer

@mattklein123
Copy link
Member

I will look at this tomorrow. I have a related change I'm working on and can probably fix this at the same time.

@lizan
Copy link
Member

lizan commented Dec 11, 2020

@EItanya @mattklein123 I've been debugging this a bit, it's not result of the capture. It's a misuse of TLS slot in aggregate cluster implementation, it never call Slot::set so the thread_local_data_ is never allocated for it, so the ASSERT will fail, until some Slot with larger index called set. Aggregate cluster just uses the TLS slot to do runOnAllThreads to update all ThreadLocalCluster in cluster manager.

@mattklein123
Copy link
Member

Ok. I can fix this tomorrow if you don't get to it tonight. Seems like we need better tests for this cluster? Is there an integration test?

mattklein123 added a commit that referenced this issue Dec 11, 2020
Final follow up from #13906. This PR does:
1) Simplify the logic during startup by making thread local clusters
   only appear after a cluster has been initialized. This is now uniform
   both for bootstrap clusters as well as CDS clusters, making the logic
   simpler to follow.
2) Aggregate cluster needed fixes due to assumptions on startup
   existence of the thread local cluster. This change also
   fixes #14119
3) Make TLS mocks verify that set() is called before other functions.

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue Dec 13, 2020
Fixes #14119

Signed-off-by: Matt Klein <mklein@lyft.com>
mattklein123 added a commit that referenced this issue Dec 14, 2020
Final follow up from #13906. This PR does:
1) Simplify the logic during startup by making thread local clusters
   only appear after a cluster has been initialized. This is now uniform
   both for bootstrap clusters as well as CDS clusters, making the logic
   simpler to follow.
2) Aggregate cluster needed fixes due to assumptions on startup
   existence of the thread local cluster. This change also
   fixes #14119
3) Make TLS mocks verify that set() is called before other functions.

Signed-off-by: Matt Klein <mklein@lyft.com>
lizan pushed a commit that referenced this issue Jan 15, 2021
Additional Description: Based on #14388
Risk Level: Low
Testing: Build and run the repro from #14119 without crashing, `bazel test test/extensions/clusters/aggregate:cluster_test`
Docs Changes: N/A
Release Notes:
#14119

Signed-off-by: Taylor Barrella <tabarr@google.com>
istio-testing pushed a commit to istio/envoy that referenced this issue Feb 5, 2021
* backport to 1.16: http: fixing a bug with IPv6 hosts (envoyproxy#14238)

Fixing a bug where HTTP parser offsets for IPv6 hosts did not include [] and Envoy assumed it did.
This results in mis-parsing addresses for IPv6 CONNECT requests and IPv6 hosts in fully URLs over HTTP/1.1

Risk Level: low
Testing: new unit, integration tests
Docs Changes: n/a
Release Notes: inline
Signed-off-by: Shikugawa <rei@tetrate.io>
Co-authored-by: alyssawilk <alyssar@chromium.org>

* backport to 1.16: vrp: allow supervisord to open its log file (envoyproxy#14066) (envoyproxy#14279)

Commit Message: Allow supervisord to open its log file
Additional Description:
Change the default location of the log file and give supervisord
permissions to write to it.

Risk Level: low
Testing: built image locally
Docs Changes: n/a
Release Notes: n/a
Platform Specific Features: n/a

Signed-off-by: Alex Konradi <akonradi@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* Closing release 1.16.2. (envoyproxy#14308)

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* Kick-off rel 1.16.3. (envoyproxy#14321)

Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* lua: reset downstream_ssl_connection in StreamInfoWrapper when object is marked dead by Lua GC (envoyproxy#14092) (envoyproxy#14449)

Co-authored-by: Marcin Falkowski <marcin.falkowski@allegro.pl>

* backport to 1.16: tls: fix detection of the upstream connection close event. (envoyproxy#13858) (envoyproxy#14452)

Fixes envoyproxy#13856.

This change also contains the following backports:
- build: Fix some unused variable warnings (envoyproxy#13987)
- test: Check in all TLS test certs (envoyproxy#13702)

Signed-off-by: Piotr Sikora <piotrsikora@google.com>
Signed-off-by: Christoph Pakulski <christoph@tetrate.io>

* backport to 1.16: aggregate cluster: fix TLS init issue (envoyproxy#14456)

Additional Description: Based on envoyproxy#14388
Risk Level: Low
Testing: Build and run the repro from envoyproxy#14119 without crashing, `bazel test test/extensions/clusters/aggregate:cluster_test`
Docs Changes: N/A
Release Notes:
envoyproxy#14119

Signed-off-by: Taylor Barrella <tabarr@google.com>

Co-authored-by: Rei Shimizu <rei@tetrate.io>
Co-authored-by: Christoph Pakulski <christoph@tetrate.io>
Co-authored-by: Marcin Falkowski <marcin.falkowski@allegro.pl>
rexengineering pushed a commit to rexengineering/istio-envoy that referenced this issue Oct 15, 2021
Final follow up from #13906. This PR does:
1) Simplify the logic during startup by making thread local clusters
   only appear after a cluster has been initialized. This is now uniform
   both for bootstrap clusters as well as CDS clusters, making the logic
   simpler to follow.
2) Aggregate cluster needed fixes due to assumptions on startup
   existence of the thread local cluster. This change also
   fixes envoyproxy/envoy#14119
3) Make TLS mocks verify that set() is called before other functions.

Signed-off-by: Matt Klein <mklein@lyft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants