[FEATURE] Improved multi vector support using Nested fields #1065

vamshin · 2023-08-25T21:50:54Z

Is your feature request related to a problem?
Related to #675

What solution would you like?
Use Parent Join feature support to retrieve all the documents for a given query instead of using child documents resulting in fewer hits apache/lucene#12434.

vamshin · 2023-08-25T21:52:10Z

Question:

Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?
Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm

heemin32 · 2023-09-22T18:13:56Z

Question:

Does this feature need any code changes from k-NN plugin to get the "Parent Join" feature or is it automatically available when we move to latest Lucene version?

It needs a code change from k-NN plugin to adapt the feature. Lucene introduced a new Query type ToParentBlockJoinByteKnnVectorQuery and ToParentBlockJoinFloatKnnVectorQuery. For join query of type has_child, we need to return those query instead of KnnFloatVectorQuery and KnnFloatVectorQuery which we are using now.

One additional field is required for ToParentBlockJoin[Byte|Float]KnnVectorQuery that we need to pass, BitSetProducer parentsFilter. Need more investigation on how to get the value in k-NN plugin.

Can this be leveraged for all the engines(Lucene, faiss, nmslib)? From my understanding its agnostic to engine, but lets confirm

The feature works only for Lucene engine as Faiss and nmslib uses our own custom Query.

heemin32 · 2023-10-03T17:15:23Z

Expected behavior

1. Create knn field with lucene engine

PUT /multi-vector
{
    "settings": {
        "index": {
            "knn": true,
            "knn.algo_param.ef_search": 100
        }
    },
    "mappings": {
        "properties": {
            "nested_field": {
                "type": "nested",
                "properties": {
                    "my_vector1": {
                        "type": "knn_vector",
                        "dimension": 3,
                        "method": {
                            "name": "hnsw",
                            "space_type": "l2",
                            "engine": "lucene",
                            "parameters": {
                                "ef_construction": 128,
                                "m": 24
                            }
                        }
                    }
                }
            }
        }
    }
}

2. Index data

PUT /_bulk?refresh=true
{ "index": { "_index": "multi-vector", "_id": "1" } }
{"nested_field":[{"my_vector1":[1,1,1]},{"my_vector1":[2,2,2]},{"my_vector1":[3,3,3]}]}
{ "index": { "_index": "multi-vector", "_id": "2" } }
{"nested_field":[{"my_vector1":[10,10,10]},{"my_vector1":[20,20,20]},{"my_vector1":[30,30,30]}]}

3. Query data

GET /multi-vector/_search
{
  "query": {
    "nested": {
      "path": "nested_field",
      "query": {
        "knn": {
          "nested_field.my_vector1": {
            "vector": [1,1,1],
            "k": 2
          }
        }
      }
    }
  }
}

4. Should return two documents (Current implementation returns 1 document)

{
	"took": 23,
	"timed_out": false,
	"_shards": {
		"total": 1,
		"successful": 1,
		"skipped": 0,
		"failed": 0
	},
	"hits": {
		"total": {
			"value": 2,
			"relation": "eq"
		},
		"max_score": 1.0,
		"hits": [
			{
				"_index": "multi-vector",
				"_id": "1",
				"_score": 1.0,
				"_source": {
					"nested_field": [
						{
							"my_vector1": [
								1,
								1,
								1
							]
						},
						{
							"my_vector1": [
								2,
								2,
								2
							]
						},
						{
							"my_vector1": [
								3,
								3,
								3
							]
						}
					]
				}
			},
			{
				"_index": "multi-vector",
				"_id": "2",
				"_score": 0.0040983604,
				"_source": {
					"nested_field": [
						{
							"my_vector1": [
								10,
								10,
								10
							]
						},
						{
							"my_vector1": [
								20,
								20,
								20
							]
						},
						{
							"my_vector1": [
								30,
								30,
								30
							]
						}
					]
				}
			}
		]
	}
}

dylan-tong-aws · 2024-02-02T23:40:45Z

@heemin32 @vamshin, if someone intentionally modeled multiple documents as vectors in a nested field, this change would break their application, correct?

Or, is there a configuration to modify the behavior?

heemin32 · 2024-02-02T23:45:30Z

It won't. After the change, it might return k results when it returned less than k results before. If the result was more than k before, the result will be same even after this change.

dylan-tong-aws · 2024-02-02T23:52:01Z

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.

I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?

Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

dylan-tong-aws · 2024-02-02T23:53:46Z

Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.

heemin32 · 2024-02-02T23:53:59Z

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.

I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?

Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

dylan-tong-aws · 2024-02-02T23:55:27Z

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

heemin32 · 2024-02-02T23:59:58Z

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.

heemin32 · 2024-02-03T00:05:40Z

Also, can you please provide an example of how to use this feature with neural search? Specifically, given a nested field of strings, how can I construct a nested field of vectors using the text embedding processor.

The question should be asked in neural search repo. There is a GH issue for it. opensearch-project/neural-search#482

dylan-tong-aws · 2024-02-03T00:05:47Z

That is correct.

Let say I index two docs with a nested field. The first doc has vectors A,B and the second doc has vectors [C,D]. A, B, C, D represent different "items" in my corpus.
I perform a search with k=2 and the most similar items are A,B. I expect the first doc to be returned. This was the previous behavior, correct?
Now, (A,B) and (C,D) represents "items" and A,B, D, C are chunks that represent these items, the new enhancement will retrieve both docs for k=2, correct?

This is correct.

Ok, so let's say I intentionally data modeled my application around the first scenario. If I upgrade to 2.12, would it not break my app? Pre 2.12 I get one doc. I upgrade now I get two docs in the result. My app functionality has changed.

I won't say it is a breaking of an app. It is a wrong way of using the nested field. If you only rely on k value, its behavior is non-deterministic. For example, in the above example, if there are two segments and doc1 is in segment1 and doc2 is in segment2, with k=2, you will get both documents as results even before this change.

One could debate what's a good data model, but there could be valid reasons for electing this data modeling design. Regardless of whether the user made a good data modeling decision, we don't govern or restrict users from being able to design their data model in either way.

I suggest we have an index configuration like "nested_vector_mode" = SINGLE | MULTI. It could be defaulted to "SINGLE". At least someone has the option to change the config to "MULTI" in case this causes a breaking change.

heemin32 · 2024-02-03T00:14:27Z

The meaning of k parameter is not the size of result. You need to pass size parameter to limit the number of final result of your query.
In short, we are increasing recall for nested field search. For example, let's say you requested to get 10 nearest vector and we returned only 2 result even if there are 5 vectors available. If we enhanced and return 5 vectors now, will it be regarded as breaking change?

dylan-tong-aws · 2024-02-03T00:35:06Z

Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?

heemin32 · 2024-02-03T02:14:35Z

Can you think of a scenario where the ranking is changed? Before we only return 2 results because the 5 most similar things are in those two documents. Now we return 5. Can you think of a scenario where the 2 original results might end up being ranked lower after the change?

No such case unless user rerank on the returned result.

asfoorial · 2024-03-31T16:55:04Z

Hi all,

Two questions here,

does rerank work on nested text fields?
If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

heemin32 · 2024-04-02T11:55:53Z

Hi all,

Two questions here,

does rerank work on nested text fields?

If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

Nested text fields is not supported in rerank processor.
Yes. An existing index with nested field will get benefitted from this implementation without reindexing.

asfoorial · 2024-04-02T11:58:24Z

Hi all,
Two questions here,

does rerank work on nested text fields?

If parent join is used internally for this feature, would it affect indices that already use parent join?

Thanks

Nested text fields is not supported in rerank processor.

Yes. An existing index with nested field will get benefitted from this implementation without reindexing.

I have an index that already uses parent join, would that conflict with this feature? As far as I know that an index can have only one parent join field.

heemin32 · 2024-04-03T01:23:27Z

It won't conflict with this feature. This feature does not use parent join internally.

vamshin added untriaged enhancement labels Aug 25, 2023

vamshin removed the untriaged label Aug 25, 2023

vamshin changed the title ~~[FEATURE] Improved k-NN Nested fields~~ [FEATURE] Improved multi vector support using Nested fields Aug 28, 2023

vamshin added the v2.11.0 label Aug 28, 2023

vamshin self-assigned this Aug 28, 2023

vamshin added k-NN backlog labels Aug 29, 2023

heemin32 mentioned this issue Sep 26, 2023

Pass parent filter to inner query in nested query opensearch-project/OpenSearch#10246

Merged

6 tasks

vamshin added the v2.12.0 label Sep 29, 2023

This was referenced Oct 2, 2023

Add parent join support for lucene knn #1181

Closed

Add parent join support for lucene knn #1182

Merged

heemin32 removed the v2.11.0 label Oct 4, 2023

heemin32 mentioned this issue Oct 30, 2023

[DOC]Multi vector support in nested field opensearch-project/documentation-website#5431

Closed

4 tasks

vamshin assigned heemin32 Nov 15, 2023

samuel-oci mentioned this issue Dec 14, 2023

[FEATURE]Ability to chunk the documents and generate multiple embeddings using k-NN nested fields. opensearch-project/neural-search#482

Closed

heemin32 mentioned this issue Dec 20, 2023

Add patch to support multi vector in faiss #1358

Merged

5 tasks

This was referenced Jan 3, 2024

Multi vector support for lucene heemin32/k-NN#1

Closed

Multi vector support for Faiss HNSW - approximate search only #1371

Merged

Handle multi-vector in exact search scenario #1378

Merged

heemin32 mentioned this issue Jan 19, 2024

Add parent join support for faiss hnsw #1398

Merged

5 tasks

heemin32 mentioned this issue Feb 2, 2024

[FEATURE] Provide context on Inner hit of multi vector to aid highlighting/debug use cases #1447

Closed

dylan-tong-aws mentioned this issue Feb 9, 2024

[RFC] Text chunking design opensearch-project/neural-search#548

Closed

ryanbogan added v2.13.0 and removed v2.12.0 labels Feb 21, 2024

samuel-oci mentioned this issue Feb 23, 2024

[META] Chunking and querying of long passages for vector search opensearch-project/neural-search#612

Open

vamshin closed this as completed Mar 15, 2024

navneet1v mentioned this issue May 14, 2024

[FEATURE] Hybrid request does not return inner_hits for nested objects. opensearch-project/neural-search#718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Improved multi vector support using Nested fields #1065

[FEATURE] Improved multi vector support using Nested fields #1065

vamshin commented Aug 25, 2023

vamshin commented Aug 25, 2023 •

edited

Loading

heemin32 commented Sep 22, 2023 •

edited

Loading

heemin32 commented Oct 3, 2023 •

edited

Loading

dylan-tong-aws commented Feb 2, 2024

heemin32 commented Feb 2, 2024

dylan-tong-aws commented Feb 2, 2024

dylan-tong-aws commented Feb 2, 2024

heemin32 commented Feb 2, 2024

dylan-tong-aws commented Feb 2, 2024

heemin32 commented Feb 2, 2024

heemin32 commented Feb 3, 2024

dylan-tong-aws commented Feb 3, 2024

heemin32 commented Feb 3, 2024 •

edited

Loading

dylan-tong-aws commented Feb 3, 2024 •

edited

Loading

heemin32 commented Feb 3, 2024

asfoorial commented Mar 31, 2024

heemin32 commented Apr 2, 2024

asfoorial commented Apr 2, 2024

heemin32 commented Apr 3, 2024 •

edited

Loading

[FEATURE] Improved multi vector support using Nested fields #1065

[FEATURE] Improved multi vector support using Nested fields #1065

Comments

vamshin commented Aug 25, 2023

vamshin commented Aug 25, 2023 • edited Loading

heemin32 commented Sep 22, 2023 • edited Loading

heemin32 commented Oct 3, 2023 • edited Loading

Expected behavior

1. Create knn field with lucene engine

2. Index data

3. Query data

4. Should return two documents (Current implementation returns 1 document)

dylan-tong-aws commented Feb 2, 2024

heemin32 commented Feb 2, 2024

dylan-tong-aws commented Feb 2, 2024

dylan-tong-aws commented Feb 2, 2024

heemin32 commented Feb 2, 2024

dylan-tong-aws commented Feb 2, 2024

heemin32 commented Feb 2, 2024

heemin32 commented Feb 3, 2024

dylan-tong-aws commented Feb 3, 2024

heemin32 commented Feb 3, 2024 • edited Loading

dylan-tong-aws commented Feb 3, 2024 • edited Loading

heemin32 commented Feb 3, 2024

asfoorial commented Mar 31, 2024

heemin32 commented Apr 2, 2024

asfoorial commented Apr 2, 2024

heemin32 commented Apr 3, 2024 • edited Loading

vamshin commented Aug 25, 2023 •

edited

Loading

heemin32 commented Sep 22, 2023 •

edited

Loading

heemin32 commented Oct 3, 2023 •

edited

Loading

heemin32 commented Feb 3, 2024 •

edited

Loading

dylan-tong-aws commented Feb 3, 2024 •

edited

Loading

heemin32 commented Apr 3, 2024 •

edited

Loading