So .. stupid idea to help a friend .. can we do a chatbot to help them get information from a website with informatio spread across multiple page? Let's try and take 2h to tackle this
Original idea of this - seems a mix of GPT3 and embedding would work to process large corpus of text.
https://twitter.com/danshipper/status/1620464918515302401 https://every.to/chain-of-thought/the-end-of-organizing
I guess we'll be reusing this a lot too.
Application to a public website? We can. Let's start with the obvious sources of information, where to too and what not to do.
https://www.mottmac.com/sitemap.xml
https://www.mottmac.com/robots.txt
For cleaning and extracting text, apart from usual beautiful soup, I had heard about trafilatura.
https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html
Let's get going !
Output : a cache folder. Gitignored here.
URLs captured in here
Trafilatura seems easier than expected.
Text cleaned captured in here
First! Some commit.
[12:28] : adding metadata to the general dataframe
So let's take a break. Going to get the pages, run the embeddings, and we'll see what happens next. Counter on hold at 1h30.
+10mins: Articles selected, cleaned and processed. Launching embedding. Prepped the AskingQuestion tool.exist
+5mins: API seems to be unstable. Backing up regularly the embedding results
Streamlit app ?
https://medium.com/@avra42/build-your-own-chatbot-with-openai-gpt-3-and-streamlit-6f1330876846
It lives there .