Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the protocol inference test infra with Mongo changes #1758

Merged

Conversation

kpattaswamy
Copy link
Member

@kpattaswamy kpattaswamy commented Oct 31, 2023

Summary: Previously, the TShark command in the dataset_generation script was not able to decode Mongo pcap files and insert them to the dataset for evaluation. This PR adds a flag to the TShark command to decode traffic running through port 27017 as Mongo. The readme is also updated to provide information about the bidirectional connection level dataset.

Updates to the confusion matrix
In the previous image, the connections per protocol in the dataset seem to have been duplicated leading to a large number of connections per protocol. This may have been due to the dataset_generation script appending data to the .tsv files each time it was ran even though the underlying pcap file content/counts not being altered.

Running the dataset_generation script with empty .tsv files with the same pcap files followed by the eval script resulted in a matrix showing much fewer number of connections per protocol, suggesting that there may have been duplication in the dataset previously.

The connection counts for each protocol in the older dataset seem to have increased by a factor of 4x or 8x the count as the new dataset and makes sense as to why the inference accuracy remained constant between the old/new matrix.

The TLS connection count had dropped in the new matrix by the previous number of Mongo connections (432) due to the new TShark command decoding mongo connections. The Mongo captures may have been previously captured in one of the early iterations of running the dataset_generation script and not updated since in the old dataset.

New mongo additions
In the old dataset, the Mongo pcap files were mainly of type OP_QUERY which is an opcode that Stirling does not currently process. More mongo pcap files of type OP_MSG were added to test the existing inference rule and this resulted in 0.9% being mislabeled as unknown due to request side data missing from the connection and the existing rule not supporting response side inference for OP_MSG packets. 0.7% was mislabeled as pgsql due to request side data also missing from the connection and the opcode of the packet being one which is not is not recognizable by Stirling.

Related issues: #640

Type of change: /kind test-infra

Test Plan: Ran the dataset generation and evaluation scripts with the new TShark flag and verified the .tsv files were created appropriately and the confusion matrix was as expected.

…o reflect new mongo captures

Signed-off-by: Kartik Pattaswamy <kpattaswamy@pixielabs.ai>
@kpattaswamy kpattaswamy marked this pull request as ready for review November 1, 2023 00:07
@kpattaswamy kpattaswamy requested a review from a team November 1, 2023 00:07
Copy link
Member

@ddelnano ddelnano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on clarifying the difference in the matrix, but otherwise lgtm.

@kpattaswamy kpattaswamy requested a review from a team November 1, 2023 20:28
@JamesMBartlett JamesMBartlett merged commit 05fb849 into pixie-io:main Nov 2, 2023
23 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants