Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔶 Add asyncIO feature for optimization of batch_translate #202

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jiaulislam
Copy link
Contributor

@jiaulislam jiaulislam commented Mar 4, 2023

Description

At current version of deep-translator(1.10.1) batch_translate is doing a linear loop of same function of single translate function. So in that context request is creating new session every time for requesting url to translate. Which is causing translator slow. Also It's working synchronously as each time request is blocking so until that request is finished translator is waiting idle. It can be optimized with asyncio feature also adding a dependency of aiohttp package and further for duplicate request we can utilize lru_cache system. Theres still long way to go for every translator to add the asyncio.

Changes Made

Changes are going to be non-breaking. Maybe we could add different function for async versions. But new dependency should be added aside of request we need to add aiohttp for async http calls.

  • Making new async version function as async_translate_batch for high level function (non-breaking change which adds functionality)
  • Added only for Google Translator with updated tests [ Function Docs Will be added in future commits ]
  • Added _async_translate as abstract method to translate a single text asynchronusly in base.py which should be implemented in every translator

Checklist:

  • My code follows the style guidelines of this project
  • I have performed a self-review of my code
  • I have added tests that prove my fix is effective or that my feature works
  • Make the asyncio aiohttp dependency optional
  • Add asyncio for deepl translator with tests
  • Add asyncio for libre translator with tests
  • Add asyncio for linguee translator with tests
  • Add asyncio for microsoft translator with tests
  • Add asyncio for mymemory translator with tests
  • Add asyncio for papago translator with tests
  • Add asyncio for pons translator with tests
  • Add asyncio for qcri translator with tests
  • Add asyncio for yandex translator with tests

Proof Of Concept

image

Usage

from deep_translator.google import GoogleTranslator
import asyncio

def medium_batch_bengali():
    return [
        "আমি ভালো আছি। তুমি কেমন আছো?",
        "আমার স্কুলে বিশাল লাইব্রেরি আছে।",
        "আমরা বাংলাদেশের জন্য সেরা করে চেষ্টা করছি।",
        "আমরা দীর্ঘ সময় ধরে একসময় সবাই প্রত্যেকের সাথে যুক্ত থাকতে চাই।",
    ]

async def main():
    translator = GoogleTranslator(target="en")
    translated_texts = await translator.async_translate_batch(medium_batch_bengali())

if __name__== "__main__":
    asyncio.run(main())

@jiaulislam
Copy link
Contributor Author

jiaulislam commented Mar 4, 2023

@nidhaloff , It would be great if you provide some of observations here.

Copy link
Owner

@nidhaloff nidhaloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I really appreciate this work. This is a great feature to add.
I will take a deeper look, but for the moment I only noticed the additional dependencies and I would like to make it optional by default. So it would only be usable if the user wants to. In that case users should install the async version using:
pip install deep-translator[async]

This can be done using extra_requires. You can check the docx or pdf extras to see how this can be done

@@ -58,6 +58,7 @@ beautifulsoup4 = "^4.9.1"
requests = "^2.23.0"
docx2txt = {version = "^0.8", optional = true}
pypdf = {version = "^3.3.0", optional = true}
aiohttp = "^3.8.4"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to make this optional, because otherwise this would introduce a new dependency and increase the size of the package. Users will not like this (myself included). Generally I'm against adding dependencies, but if its optional, then its ok, so IMO just make it optional and add an extra_requires: async for it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok adding to checklist for the task.

Copy link
Owner

@nidhaloff nidhaloff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to not re-create an async_translate function. Instead maybe it would be better to add an async=False parameter to the normal translate function and then basically reuse the translate function with async=True inside your async_translate_batch function.

That way, you will not have to re-define another translate function, which will be 90% similar to the normal translate function and hence repeating yourself.

from pathlib import Path
from typing import List, Optional, Union

import aiohttp
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need a try catch here when you make this depedency optional or maybe just add the import inside the function where this will be used. For an example, check out the docx or pypdf dependencies.

@@ -128,6 +132,23 @@ def translate(self, text: str, **kwargs) -> str:
"""
return NotImplemented("You need to implement the translate method!")

@abstractmethod
@lru_cache(maxsize=128)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be better if you make the cache_maxsize value in a separate global config.py file

self, batch: List[str], **kwargs
) -> List[str]:
if not batch:
raise Exception("Enter your text list that you want to translate")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add this custom exception to exceptions.py just to keep everything consistent? Something like a NotValidInputBatch exception

translation_tasks = [
self._async_translate(text, session) for text in batch
]
return await asyncio.gather(*translation_tasks)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can get a ValueError here if translation_tasks is empty. You may want to add a check for that

async def _async_translate(
self, text: str, session: aiohttp.ClientSession, **kwargs
):
if is_input_valid(text):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since most of the code in this function is the same as the non-async translate, maybe you can find a way to make some parts reusable

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nidhaloff I'll try to keep all the things in mind and fix it. Thanks for you detailed explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants