Update the protocol inference test infra with Mongo changes #1758

kpattaswamy · 2023-10-31T23:58:25Z

Summary: Previously, the TShark command in the dataset_generation script was not able to decode Mongo pcap files and insert them to the dataset for evaluation. This PR adds a flag to the TShark command to decode traffic running through port 27017 as Mongo. The readme is also updated to provide information about the bidirectional connection level dataset.

Updates to the confusion matrix
In the previous image, the connections per protocol in the dataset seem to have been duplicated leading to a large number of connections per protocol. This may have been due to the dataset_generation script appending data to the .tsv files each time it was ran even though the underlying pcap file content/counts not being altered.

Running the dataset_generation script with empty .tsv files with the same pcap files followed by the eval script resulted in a matrix showing much fewer number of connections per protocol, suggesting that there may have been duplication in the dataset previously.

The connection counts for each protocol in the older dataset seem to have increased by a factor of 4x or 8x the count as the new dataset and makes sense as to why the inference accuracy remained constant between the old/new matrix.

The TLS connection count had dropped in the new matrix by the previous number of Mongo connections (432) due to the new TShark command decoding mongo connections. The Mongo captures may have been previously captured in one of the early iterations of running the dataset_generation script and not updated since in the old dataset.

New mongo additions
In the old dataset, the Mongo pcap files were mainly of type OP_QUERY which is an opcode that Stirling does not currently process. More mongo pcap files of type OP_MSG were added to test the existing inference rule and this resulted in 0.9% being mislabeled as unknown due to request side data missing from the connection and the existing rule not supporting response side inference for OP_MSG packets. 0.7% was mislabeled as pgsql due to request side data also missing from the connection and the opcode of the packet being one which is not is not recognizable by Stirling.

Related issues: #640

Type of change: /kind test-infra

Test Plan: Ran the dataset generation and evaluation scripts with the new TShark flag and verified the .tsv files were created appropriately and the confusion matrix was as expected.

…o reflect new mongo captures Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>

src/stirling/protocol_inference/dataset_generation.py

ddelnano

One comment on clarifying the difference in the matrix, but otherwise lgtm.

Update protocol inference script, readme and confusion matrix image t…

f022be5

…o reflect new mongo captures Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>

kpattaswamy marked this pull request as ready for review November 1, 2023 00:07

kpattaswamy requested a review from a team November 1, 2023 00:07

ddelnano reviewed Nov 1, 2023

View reviewed changes

src/stirling/protocol_inference/dataset_generation.py Show resolved Hide resolved

ddelnano reviewed Nov 1, 2023

View reviewed changes

ddelnano approved these changes Nov 1, 2023

View reviewed changes

kpattaswamy requested a review from a team November 1, 2023 20:28

JamesMBartlett approved these changes Nov 2, 2023

View reviewed changes

JamesMBartlett merged commit 05fb849 into pixie-io:main Nov 2, 2023
23 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the protocol inference test infra with Mongo changes #1758

Update the protocol inference test infra with Mongo changes #1758

kpattaswamy commented Oct 31, 2023 •

edited

Loading

ddelnano left a comment

Update the protocol inference test infra with Mongo changes #1758

Update the protocol inference test infra with Mongo changes #1758

Conversation

kpattaswamy commented Oct 31, 2023 • edited Loading

ddelnano left a comment

Choose a reason for hiding this comment

kpattaswamy commented Oct 31, 2023 •

edited

Loading