Web scraping with Python used for a TTS Reader life hack.
Status: currently in progress.
Extract text from a series of web pages to use in TTS Reader.
My eyes get tired. After staring at the screen for many hours every day, eventually they give me a clear signal that they’ve had enough. With so many things I want to do, how can I give them the rest they need without losing precious time? Solution: utilize a text-to-speech software to do some reading for me.
I discovered this life hack back when I went back to college. I quickly realized that there simply weren’t enough hours in the day to read all the needed materials while working full time. I explored various text-to-speech solutions and TTS Reader by far works the best.
Check it out here: https://ttsreader.com/
How TTS Reader works:
You copy the text you need, paste it into the box and press play. Unlike other readers out there, you don’t need to upload any files, you don’t need to download anything. You simply paste and go. There is no wait time because it doesn’t convert the entire text you pasted into speech. It converts it one sentence at time which results in a natural sounding speech which is easy to listen to.
I like to visit this one online library. Each book is spread onto many web pages. To have TTS Reader read it to me, I need to copy and paste text from each individual page. This gets tiring very quickly.
What if I could grab the text from all of the pages of a book in one go and paste it into the reader so it could read me the entire book?
Are there solutions out there that already do that? Maybe but what’s the fun in that when I have a chance to apply a real-world problem to my learning plans?
What we need for this project:
- Separate text from other web page content.
- Extract that text and paste it into a placeholder.
- Move on to the next page and iterate step 1+2
- Bonus: connect directly to TTS Reader so no manual copy+paste is needed.
We’ll be using Python to write the program and Netbeans to execute it. You don’t need to have Netbeans if you’d like to follow along.
Step 1: Separate text from other web page content.
We need to know how each page is structured so we’d know which HTML tags our program should look for.
Navigate to the page, right-click and pick Inspect. If you’re using Chrome, you’ll see that as you hover over different tags, different parts of the page are highlighted. Ignore the header and all script tags. Somewhere, within the body tag, there will be a set of tags which highlight the text of the page.
If more than just the text is highlighted, it means that you need to dig deeper. Click on the arrow to expand what’s inside that tag. There will be more tags. Do the same steps until you find the last tag that selects all of the text you need.
Doing this exercise on this very page, you’ll find that the text is within div class=”twp-article-wrapper clearfix” tag.
To be continued…