Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Improved multi vector support using Nested fields #1065

Closed
vamshin opened this issue Aug 25, 2023 · 19 comments
Closed

[FEATURE] Improved multi vector support using Nested fields #1065

vamshin opened this issue Aug 25, 2023 · 19 comments

Comments

@vamshin
Copy link
Member

vamshin commented Aug 25, 2023

Is your feature request related to a problem?
Related to #675

What solution would you like?
Use Parent Join feature support to retrieve all the documents for a given query instead of using child documents resulting in fewer hits apache/lucene#12434.

@vamshin
Copy link
Member Author

vamshin commented Aug 25, 2023

Question:

  1. Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?

  2. Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm

@vamshin vamshin removed the untriaged label Aug 25, 2023
@vamshin vamshin changed the title [FEATURE] Improved k-NN Nested fields [FEATURE] Improved multi vector support using Nested fields Aug 28, 2023
@vamshin vamshin self-assigned this Aug 28, 2023
@heemin32
Copy link
Collaborator

heemin32 commented Sep 22, 2023

Question:

  1. Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?

It needs a code change from k-NN plugin to adapt the feature. Lucene introduced a new Query type ToParentBlockJoinByteKnnVectorQuery and ToParentBlockJoinFloatKnnVectorQuery. For join query of type has_child, we need to return those query instead of KnnFloatVectorQuery and KnnFloatVectorQuery which we are using now.

One additional field is required for ToParentBlockJoin[Byte|Float]KnnVectorQuery that we need to pass, BitSetProducer parentsFilter. Need more investigation on how to get the value in k-NN plugin.

  1. Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm

The feature works only for Lucene engine as Faiss and nmslib uses our own custom Query.

@heemin32
Copy link
Collaborator

heemin32 commented Oct 3, 2023

Expected behavior

1. Create knn field with lucene engine

PUT /multi-vector
{
    "settings": {
        "index": {
            "knn": true,
            "knn.algo_param.ef_search": 100
        }
    },
    "mappings": {
        "properties": {
            "nested_field": {
                "type": "nested",
                "properties": {
                    "my_vector1": {
                        "type": "knn_vector",
                        "dimension": 3,
                        "method": {
                            "name": "hnsw",
                            "space_type": "l2",
                            "engine": "lucene",
                            "parameters": {
                                "ef_construction": 128,
                                "m": 24
                            }
                        }
                    }
                }
            }
        }
    }
}

2. Index data

PUT /_bulk?refresh=true
{ "index": { "_index": "multi-vector", "_id": "1" } }
{"nested_field":[{"my_vector1":[1,1,1]},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}]}
{ "index": { "_index": "multi-vector", "_id": "2" } }
{"nested_field":[{"my_vector1":[10,10,10]},{"my_vector1":[20,20,20]},{"my_vector1":[30,30,30]}]}

3. Query data

GET /multi-vector/_search
{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector1": {
            "vector": [1,1,1],
            "k": 2
          }
        }
      }
    }
  }
}

4. Should return two documents (Current implementation returns 1 document)

{
	"took": 23,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 2,
			"relation": "eq"
		},
		"max_score": 1.0,
		"hits": [
			{
				"_index": "multi-vector",
				"_id": "1",
				"_score": 1.0,
				"_source": {
					"nested_field": [
						{
							"my_vector1": [
								1,
								1,
								1
							]
						},
						{
							"my_vector1": [
								2,
								2,
								2
							]
						},
						{
							"my_vector1": [
								3,
								3,
								3
							]
						}
					]
				}
			},
			{
				"_index": "multi-vector",
				"_id": "2",
				"_score": 0.0040983604,
				"_source": {
					"nested_field": [
						{
							"my_vector1": [
								10,
								10,
								10
							]
						},
						{
							"my_vector1": [
								20,
								20,
								20
							]
						},
						{
							"my_vector1": [
								30,
								30,
								30
							]
						}
					]
				}
			}
		]
	}
}

@dylan-tong-aws
Copy link

@heemin32 @vamshin, if someone intentionally modeled multiple documents as vectors in a nested field, this change would break their application, correct?

Or, is there a configuration to modify the behavior?

@heemin32
Copy link
Collaborator

heemin32 commented Feb 2, 2024

It won't. After the change, it might return k results when it returned less than k results before. If the result was more than k before, the result will be same even after this change.

@dylan-tong-aws
Copy link

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.

I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?

Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

@dylan-tong-aws
Copy link

Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.

@heemin32
Copy link
Collaborator

heemin32 commented Feb 2, 2024

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.

I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?

Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

@dylan-tong-aws
Copy link

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

@heemin32
Copy link
Collaborator

heemin32 commented Feb 2, 2024

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.

@heemin32
Copy link
Collaborator

heemin32 commented Feb 3, 2024

Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.

The question should be asked in neural search repo. There is a GH issue for it. opensearch-project/neural-search#482

@dylan-tong-aws
Copy link

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.

One could debate what's a good data model, but there could be valid reasons for electing this data modeling design. Regardless of whether the user made a good data modeling decision, we don't govern or restrict users from being able to design their data model in either way.

I suggest we have an index configuration like "nested_vector_mode" = SINGLE | MULTI. It could be defaulted to "SINGLE". At least someone has the option to change the config to "MULTI" in case this causes a breaking change.

@heemin32
Copy link
Collaborator

heemin32 commented Feb 3, 2024

The meaning of k parameter is not the size of result. You need to pass size parameter to limit the number of final result of your query.
In short, we are increasing recall for nested field search. For example, let's say you requested to get 10 nearest vector and we returned only 2 result even if there are 5 vectors available. If we enhanced and return 5 vectors now, will it be regarded as breaking change?

@dylan-tong-aws
Copy link

dylan-tong-aws commented Feb 3, 2024

Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?

@heemin32
Copy link
Collaborator

heemin32 commented Feb 3, 2024

Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?

No such case unless user rerank on the returned result.

@asfoorial
Copy link

Hi all,

Two questions here,

  1. does rerank work on nested text fields?
  2. If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

@heemin32
Copy link
Collaborator

heemin32 commented Apr 2, 2024

Hi all,

Two questions here,

  1. does rerank work on nested text fields?
  2. If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

  1. Nested text fields is not supported in rerank processor.
  2. Yes. An existing index with nested field will get benefitted from this implementation without reindexing.

@asfoorial
Copy link

Hi all,
Two questions here,

  1. does rerank work on nested text fields?
  2. If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

  1. Nested text fields is not supported in rerank processor.
  2. Yes. An existing index with nested field will get benefitted from this implementation without reindexing.

I have an index that already uses parent join, would that conflict with this feature? As far as I know that an index can have only one parent join field.

@heemin32
Copy link
Collaborator

heemin32 commented Apr 3, 2024

It won't conflict with this feature. This feature does not use parent join internally.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

5 participants