Accept multiple ingest pipelines in Filebeat #8914

ycombinator · 2018-11-02T22:38:40Z

Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to:

run sub-pipelines via the pipeline processor, and
conditionally run processors via an if field.

These abilities combined present the opportunity for a fileset to ingest the same logical information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline.

This PR allows filesets to specify one or more ingest pipelines via the ingest_pipeline property in their manifest.yml. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline.

Example with multiple pipelines

ingest_pipeline:
  - pipeline-ze-boss.json 
  - pipeline-plain.json
  - pipeline-json.json

Example with a single pipeline

This is just to show that the existing functionality will continue to work as-is.

ingest_pipeline: pipeline.json

Now, if the root pipeline wants to delegate processing to another pipeline, it must use a pipeline processor to do so. This processor's name field will need to reference the other pipeline by its name. To ensure correct referencing, the name field must be specified as follows:

{
  "pipeline" : {
    "name": "{< IngestPipeline "pipeline-plain" >}"
  }
}

This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with filebeat-<version>-<module>-<fileset>- when loading them into Elasticsearch.

filebeat/fileset/fileset.go

ycombinator · 2018-11-08T00:58:05Z

There's still one TODO item left for this PR related to documentation but I think it's ready for a code review. Thanks folks!

andrewkroh · 2018-11-08T02:28:31Z

CHANGELOG-developer.asciidoc

@@ -64,3 +64,4 @@ The list below covers the major changes between 6.3.0 and master only.
 - Allow to disable config resolver using the `Settings.DisableConfigResolver` field when initializing libbeat. {pull}8769[8769]
 - Add `mage.AddPlatforms` to allow to specify dependent platforms when building a beat. {pull}8889[8889]
 - Add `cfgwarn.CheckRemoved6xSetting(s)` to display a warning for options removed in 7.0. {pull}8909[8909]
+- Filesets can now define multiple ingest pipelines, with the first one considered as the root pipeline. {pull}8914[8914]


Is “root” synonmous with entrypoint in this context?

Yes, and I'm happy to change the terminology to whatever is clearest to most folks.

I like entry point better because "root" suggests that there is always a hierarchy. I'm not sure it that's true. Is it possible that devs might want to specify multiple pipelines as a way of breaking their ingest pipeline configs into smaller pipelines that encapsulate specific processing tasks? Also, can other pipelines besides the first one delegate processing? If so, I would avoid using "root".

ycombinator · 2018-11-08T05:07:01Z

jenkins, test this

ycombinator · 2018-11-08T13:09:31Z

@dedemorton I've added docs for this feature in df2488a27e171a8f10d8fac6766ae7a352dbf1a8 but I'm not sure about my language and structure/organization. I'd love for you to take a look, if you have some time. Thanks!

ycombinator · 2018-11-08T18:33:51Z

jenkins, test this

ycombinator · 2018-11-08T23:09:18Z

jenkins, test this

ruflin · 2018-11-12T00:41:06Z

What happens if a user uses a module with multiple pipelines against and older version of Elasticsearch?

ycombinator · 2018-11-14T15:03:34Z

What happens if a user uses a module with multiple pipelines against and older version of Elasticsearch?

I tested this PR with Elasticsearch 6.4.0, where neither the pipelines processor nor the if field on processors exist.

There are two related parts to this PR. One worked with ES 6.4.0, but the other did not. Specifically:

The part where multiple ingest pipelines are specified under the ingest_pipeline key in the module's manifest.yml file will work with older versions of ES. All pipelines listed will get loaded into ES.
The part where a pipeline uses the pipeline processor or the if processor field (both features introduced in ES 6.5) will not work with older versions of ES. Concretely, when Filebeat attempts to load such a pipeline into ES, ES will return a 400 with an error message that Filebeat emits to its log. FWIW, this can happen today (in master) as well if a pipeline is using a processor that has not been installed in ES as a plugin, e.g. the CSV processor plugin.

Note that the pipeline loading (within a fileset) is short-circuiting. For example, imagine that a module developer has specified 3 pipelines in the list under ingest_pipeline in the module's manifest.yml. Let's say the 2nd one in the list fails loading because it contains a pipeline processor, or an if somewhere, or for some other reason. The 1st pipeline in the list would've been successfully loaded into ES but the 3rd one will not get loaded. And obviously the 2nd one won't get loaded either because it caused the ES error.

I can see four options on how to proceed:

Don't load the pipelines (within a fileset) in a short-circuiting fashion. Instead, even if one pipeline fails to load for some reason, emit the error in the log, but continue to try and load the remaining pipelines in the list.
If a pipeline in the list fails to load, try to DELETE previously-loaded pipelines for the fileset (i.e rollback). This way either all pipelines within the fileset get loaded or none of them do.
Before attempting to load any pipelines for a fileset, ask Elasticsearch to validate all of them. Again, this way either all pipelines within the fileset get loaded or none of them do.
Keep it simple and keep the current short-circuiting approach. Depending on the error, users might need to upgrade ES or perform some other fix, such as loading a plugin, or filing a bug to get the pipeline definition in Filebeat fixed up. Once the fix is made, they can simply re-run Filebeat and any previously-erroring pipelines will get loaded into ES.

My personal preference would be option 1. It is essentially like option 4 but doesn't give up on the remaining pipelines either, so if multiple pipelines have different errors, at least the user can find out about them all at once. Option 2 could lead to an inconsistent state (if any of the DELETEs fail). Option 3 is probably the safest but also unjustifiably expensive.

kvch · 2018-11-14T16:59:18Z

I would not go with option 1, because it pollutes the ES instance of users with possibly unused pipelines. I prefer the solutions with rollbacks.

ruflin · 2018-11-19T08:02:23Z

Can you elaborate on why you think 3 is too expensive? I would expect checking for the pipelines does not happen too often so I would not be worried about the extra load.

I just realised this also touches and other problem: What if geo or user_agent are not installed? At the moment I think it does not get loaded as we only have one pipeline but not sure how good the error is. We already have requirements for the pipelines in the manifest, there we could also add requirements for the ES version and use option 3?

ruflin · 2018-12-27T08:34:35Z

@ycombinator I'm good with merging. But in case #9777 gets in before this one, a rebase would be nice.

Upcating unit test

Motivated by elastic#8852 (comment). Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to: - run sub-pipelines via the [`pipeline` processor](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/pipeline-processor.html), and - conditionally run processors via an [`if` field](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/ingest-processors.html). These abilities combined present the opportunity for a fileset to ingest the same _logical_ information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline. This PR allows filesets to specify one or more ingest pipelines via the `ingest_pipeline` property in their `manifest.yml`. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline. ```yaml ingest_pipeline: - pipeline-ze-boss.json - pipeline-plain.json - pipeline-json.json ``` _This is just to show that the existing functionality will continue to work as-is._ ```yaml ingest_pipeline: pipeline.json ``` Now, if the root pipeline wants to delegate processing to another pipeline, it must use a `pipeline` processor to do so. This processor's `name` field will need to reference the other pipeline by its name. To ensure correct referencing, the `name` field must be specified as follows: ```json { "pipeline" : { "name": "{< IngestPipeline "pipeline-plain" >}" } } ``` This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with `filebeat-<version>-<module>-<fileset>-` when loading them into Elasticsearch. (cherry picked from commit 5ba1f11)

#9811) Cherry-pick of PR #8914 to 6.x branch. Original message: Motivated by #8852 (comment). Starting with 6.5.0, Elasticsearch Ingest Pipelines have gained the ability to: - run sub-pipelines via the [`pipeline` processor](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/pipeline-processor.html), and - conditionally run processors via an [`if` field](https://www.elastic.co/guide/en/elasticsearch/reference/6.5/ingest-processors.html). These abilities combined present the opportunity for a fileset to ingest the same _logical_ information presented in different formats, e.g. plaintext vs. json versions of the same log files. Imagine an entry point ingest pipeline that detects the format of a log entry and then conditionally delegates further processing of that log entry, depending on the format, to another pipeline. This PR allows filesets to specify one or more ingest pipelines via the `ingest_pipeline` property in their `manifest.yml`. If more than one ingest pipeline is specified, the first one is taken to be the entry point ingest pipeline. #### Example with multiple pipelines ```yaml ingest_pipeline: - pipeline-ze-boss.json - pipeline-plain.json - pipeline-json.json ``` #### Example with a single pipeline _This is just to show that the existing functionality will continue to work as-is._ ```yaml ingest_pipeline: pipeline.json ``` Now, if the root pipeline wants to delegate processing to another pipeline, it must use a `pipeline` processor to do so. This processor's `name` field will need to reference the other pipeline by its name. To ensure correct referencing, the `name` field must be specified as follows: ```json { "pipeline" : { "name": "{< IngestPipeline "pipeline-plain" >}" } } ``` This will ensure that the specified name gets correctly converted to the corresponding name in Elasticsearch, since Filebeat prefixes it's "raw" Ingest pipeline names with `filebeat-<version>-<module>-<fileset>-` when loading them into Elasticsearch.

… 6.5 (#10001) Follow up to #8914. In #8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0. This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error.

… 6.5 (elastic#10001) Follow up to elastic#8914. In elastic#8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0. This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error. (cherry picked from commit c55226e)

…nes is being used with ES < 6.5 (#10038) Cherry-pick of PR #10001 to 6.x branch. Original message: Follow up to #8914. In #8914, we introduced the ability for Filebeat filesets to have multiple Ingest pipelines, the first one being the entry point. This feature relies on the Elasticsearch Ingest node having a `pipeline` processor and `if` conditions for processors, both of which were introduced in Elasticsearch 6.5.0. This PR implements a check for whether a fileset has multiple Ingest pipelines AND is talking to an Elasticsearch cluster < 6.5.0. If that's the case, we emit an error.

ycombinator added enhancement in progress Pull request is currently in progress. Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. v7.0.0-alpha1 v6.6.0 labels Nov 2, 2018

houndci-bot reviewed Nov 2, 2018

View reviewed changes

filebeat/fileset/fileset.go Outdated Show resolved Hide resolved

ycombinator changed the title ~~Accept multiple ingest pipelines in Filebeat~~ WIP: Accept multiple ingest pipelines in Filebeat Nov 2, 2018

ycombinator mentioned this pull request Nov 2, 2018

Filebeat: ingest Elasticsearch structured audit logs #8852

Merged

ycombinator force-pushed the filebeat-multiple-ingest-pipelines branch 2 times, most recently from d4feea1 to 2e2add6 Compare November 8, 2018 00:55

ycombinator requested review from kvch, webmat and ruflin November 8, 2018 00:57

ycombinator added the review label Nov 8, 2018

andrewkroh reviewed Nov 8, 2018

View reviewed changes

ycombinator changed the title ~~WIP: Accept multiple ingest pipelines in Filebeat~~ Accept multiple ingest pipelines in Filebeat Nov 8, 2018

ycombinator removed the in progress Pull request is currently in progress. label Nov 8, 2018

ycombinator requested a review from dedemorton November 9, 2018 11:42

kvch mentioned this pull request Nov 12, 2018

Support multiple ingest pipelines in Filebeat pipeline tester #9039

Closed

ycombinator force-pushed the filebeat-multiple-ingest-pipelines branch from df2488a to 2ab1374 Compare November 14, 2018 14:19

ycombinator added 15 commits December 27, 2018 09:10

Accept multiple ingest pipelines

85b8e4b

Define IngestPipeline function

b5ca956

Removing unncessarily-exported method

62d556a

Update unit tests

564aa1d

Renaming argument name

4b990a7

Upcating unit test

Adding unit test

9d6eefb

Adding entry to developer CHANGELOG

e9ddf03

Fixing unit tests

75820fb

Adding docs

f586962

Rollback pipelines (for a fileset) on error

ac2c0c2

Rename root ==> entry point

eada55f

Addressing documentation review feedback

44ed9e0

Doc review feedback changes

bee5af6

Fixing JSON syntax in doc example

0a10ae2

Adding tests for multiple pipeline loading, including rollback

c5cb4d7

ycombinator force-pushed the filebeat-multiple-ingest-pipelines branch from 1770149 to c5cb4d7 Compare December 27, 2018 17:11

ycombinator merged commit 5ba1f11 into elastic:master Dec 27, 2018

ycombinator deleted the filebeat-multiple-ingest-pipelines branch December 27, 2018 19:19

ycombinator mentioned this pull request Dec 27, 2018

Cherry-pick #8914 to 6.x: Accept multiple ingest pipelines in Filebeat #9811

Merged

ycombinator removed the needs_backport PR is waiting to be backported to other branches. label Dec 27, 2018

ycombinator mentioned this pull request Jan 10, 2019

Emit error if fileset with multiple pipelines is being used with ES < 6.5 #10001

Merged

ycombinator mentioned this pull request Jan 13, 2019

Cherry-pick #10001 to 6.x: Emit error if fileset with multiple pipelines is being used with ES < 6.5 #10038

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Accept multiple ingest pipelines in Filebeat #8914

Accept multiple ingest pipelines in Filebeat #8914

ycombinator commented Nov 2, 2018 •

edited

Loading

ycombinator commented Nov 8, 2018

andrewkroh Nov 8, 2018

ycombinator Nov 8, 2018 •

edited

Loading

dedemorton Dec 11, 2018

ycombinator commented Nov 8, 2018

ycombinator commented Nov 8, 2018

ycombinator commented Nov 8, 2018

ycombinator commented Nov 8, 2018

ruflin commented Nov 12, 2018

ycombinator commented Nov 14, 2018 •

edited

Loading

kvch commented Nov 14, 2018 •

edited

Loading

ruflin commented Nov 19, 2018

ruflin commented Dec 27, 2018

Accept multiple ingest pipelines in Filebeat #8914

Accept multiple ingest pipelines in Filebeat #8914

Conversation

ycombinator commented Nov 2, 2018 • edited Loading

Example with multiple pipelines

Example with a single pipeline

ycombinator commented Nov 8, 2018

andrewkroh Nov 8, 2018

Choose a reason for hiding this comment

ycombinator Nov 8, 2018 • edited Loading

Choose a reason for hiding this comment

dedemorton Dec 11, 2018

Choose a reason for hiding this comment

ycombinator commented Nov 8, 2018

ycombinator commented Nov 8, 2018

ycombinator commented Nov 8, 2018

ycombinator commented Nov 8, 2018

ruflin commented Nov 12, 2018

ycombinator commented Nov 14, 2018 • edited Loading

kvch commented Nov 14, 2018 • edited Loading

ruflin commented Nov 19, 2018

ruflin commented Dec 27, 2018

ycombinator commented Nov 2, 2018 •

edited

Loading

ycombinator Nov 8, 2018 •

edited

Loading

ycombinator commented Nov 14, 2018 •

edited

Loading

kvch commented Nov 14, 2018 •

edited

Loading