Website_AMA

So .. stupid idea to help a friend .. can we do a chatbot to help them get information from a website with informatio spread across multiple page? Let's try and take 2h to tackle this

[11:40] Start and background

Original idea of this - seems a mix of GPT3 and embedding would work to process large corpus of text.

https://twitter.com/danshipper/status/1620464918515302401 https://every.to/chain-of-thought/the-end-of-organizing

I guess we'll be reusing this a lot too.

https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

Application to a public website? We can. Let's start with the obvious sources of information, where to too and what not to do.

https://www.mottmac.com/sitemap.xml

https://www.mottmac.com/robots.txt

For cleaning and extracting text, apart from usual beautiful soup, I had heard about trafilatura.

https://adrien.barbaresi.eu/blog/trafilatura-main-text-content-python.html

Let's get going !

[11:50] URLs cleared and processed

Output : a cache folder. Gitignored here.

URLs captured in here

[12:00] Getting pages works! Onto getting text

Trafilatura seems easier than expected.

Text cleaned captured in here

[12:05] Trying embedding.

First! Some commit.

[12:28] : adding metadata to the general dataframe

[13:11] Embedding mechanisms saved

So let's take a break. Going to get the pages, run the embeddings, and we'll see what happens next. Counter on hold at 1h30.

+10mins: Articles selected, cleaned and processed. Launching embedding. Prepped the AskingQuestion tool.exist

+5mins: API seems to be unstable. Backing up regularly the embedding results

[16:30] All done. 15mins spare.

[16:46] What next ?

Streamlit app ?

https://medium.com/@avra42/build-your-own-chatbot-with-openai-gpt-3-and-streamlit-6f1330876846

[18:45] Streamlit app deployed

It lives there .

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
images		images
01.Data.ipynb		01.Data.ipynb
02.Embeddings.ipynb		02.Embeddings.ipynb
03.AskingQuestions.ipynb		03.AskingQuestions.ipynb
LICENSE		LICENSE
README.md		README.md
app.py		app.py
pages.xlsx		pages.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Website_AMA

[11:40] Start and background

[11:50] URLs cleared and processed

[12:00] Getting pages works! Onto getting text

[12:05] Trying embedding.

[13:11] Embedding mechanisms saved

[16:30] All done. 15mins spare.

[16:46] What next ?

[18:45] Streamlit app deployed

Demo:

About

Releases

Packages

Languages

License

kelu124/Website_AMA

Folders and files

Latest commit

History

Repository files navigation

Website_AMA

[11:40] Start and background

[11:50] URLs cleared and processed

[12:00] Getting pages works! Onto getting text

[12:05] Trying embedding.

[13:11] Embedding mechanisms saved

[16:30] All done. 15mins spare.

[16:46] What next ?

[18:45] Streamlit app deployed

Demo:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages