Skip to content

Python class-based project to recommend and discover academic papers from arXiv based on NLP.

License

Notifications You must be signed in to change notification settings

mariamwalaa/arXivRecSys

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ArXiv Paper Recommendation System

The ArXiv Paper Recommendation System is a Python class-based project that allows you to discover and recommend academic papers from the arXiv repository based on textual similarity. This system leverages natural language processing techniques, including TF-IDF vectorization and cosine similarity, to suggest papers that are similar to a selected paper of interest.

Features

  • Data Retrieval: Automatically fetches academic paper metadata, including titles and abstracts, from arXiv within specific mathematical categories (e.g., math.CO, math.PR, math.ST).

  • TF-IDF Vectorization: Utilizes TF-IDF (Term Frequency-Inverse Document Frequency) vectorization to convert paper abstracts into numerical vectors, capturing the importance of terms relative to the entire corpus.

  • Cosine Similarity: Calculates cosine similarity scores between papers to determine their textual similarity based on TF-IDF vectors.

  • Graphical User Interface (GUI): Provides a user-friendly Tkinter-based GUI for users to select a paper title and discover the top 5 most similar papers based on the selected paper.

Getting Started

These instructions will help you set up and run the ArXiv Paper Recommendation System using the provided Python class.

Prerequisites

Make sure you have the following libraries installed:

  • requests
  • feedparser
  • pandas
  • scikit-learn (for TF-IDF and cosine similarity)
  • tkinter (for the GUI)

You can install these libraries using pip:

pip install requests feedparser pandas scikit-learn

Usage

  1. Clone this repository or download the arxiv_recommendation.py file to your local machine.
  2. Create an instance of the ArXivPaperRecommendationSystem class and call its methods:
if __name__ == "__main__":

    # Initialize the ArXiv Paper Recommendation System
    recommendation_system = ArXivPaperRecommendationSystem()

    # 1. Retrieve arXiv data
    recommendation_system.retrieve_arxiv_data()

    # 2. Calculate TF-IDF vectors and similarities
    recommendation_system.calculate_similarity()

    # 3. Create the GUI
    recommendation_system.create_gui()

    # 4. Run the system and start the Tkinter main loop
    recommendation_system.run()
  1. The GUI will open, allowing you to select a paper title and find similar papers based on textual similarity.

Use Case

Here are some categories that may be of interest to a data scientist.

  1. stat.ML (Machine Learning):

    • This category covers a broad spectrum of machine learning research. Data scientists can find papers on various ML algorithms, deep learning techniques, and applications. Staying updated in this field is crucial for improving machine learning models and solving real-world problems.
  2. cs.DS (Data Structures and Algorithms):

    • Data scientists deal with large datasets and often need to optimize data structures and algorithms for efficient processing. This category provides insights into advanced data structures and algorithmic techniques.
  3. math.CO (Combinatorics):

    • Combinatorics deals with counting, arranging, and analyzing objects. Data scientists can benefit from this category when working with discrete structures and combinatorial optimization problems, which often arise in data analysis, network analysis, and graph algorithms.
  4. stat.AP (Applications):

    • Data scientists frequently apply statistical methods to solve real-world problems. This category includes research on statistical applications in various fields, providing valuable insights into practical data analysis.
  5. cs.CV (Computer Vision and Pattern Recognition):

    • Computer vision plays a crucial role in image and video analysis. Data scientists working on image recognition, object detection, and related tasks can find innovative techniques and models in this category.
  6. stat.TH (Statistics - Theory):

    • Understanding the theoretical foundations of statistics is essential for data scientists. This category delves into statistical theory, which underpins many data analysis and modeling techniques.
  7. cs.CL (Computation and Language):

    • Natural language processing (NLP) is a vital area for data scientists working on text data. This category covers NLP research, including sentiment analysis, text classification, and language modeling.
  8. econ.EM (Econometrics):

    • Data scientists interested in economics and econometric modeling can find relevant research here. Econometrics combines statistical methods with economic theory to analyze economic data.
  9. math.ST (Statistics):

    • This category focuses exclusively on statistics. Data scientists seeking a deep understanding of statistical concepts, hypothesis testing, and experimental design can benefit from these papers.
  10. q-bio.QM (Quantitative Methods):

    • Quantitative methods are crucial in biology and bioinformatics. Data scientists working in these domains can find papers related to modeling biological systems, genomics, and bioinformatic data analysis.

About

Python class-based project to recommend and discover academic papers from arXiv based on NLP.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published