Tag Archives: Youtube

Scraping Youtube Video Playlist

Youtube is not just a website for cute animal videos. I found it a great place to learn programming (and other interesting skills), especially for non-professional coders like myself. I have been trying to learn more about Python by following a number of great channels like sentdexnoobtoprofessional and CodeGeek etc. As I am progressing my “Youtubeducation”, I have developed the need to track what I have been watching along side other learning activities, I wanted to scrape the links and titles of the videos for my notes. Taking this as an opportunity to practice what I have learned, I have started coding a Python class to scrape any Youtube playlist when given a URL.

Iteration #1:

Without inspecting the HTML of a playlist webpage, I started coding. In hindsight, this was so WRONG in many ways. Anyway, you only learn by practicing and making mistakes.

On every playlist page, take this CodeGeek’s playlist for example, there are no links to other videos except those on the list. Each video is linked twice, one through its thumbnail and the other via the title text, except the first video which has 2 more links at the top as the Play All option. The link to a video looks like this:

or

So I came up with this idea of extracting all links begin with /watch?v=. Then make them into a Python set() to remove the duplicates, then I will have the list of URLs to all videos. Here is what the code looked like:

Once this rawList is generated, the indices extracted from index= in the URLs are used to sort the final list in order.

Of course there was some parsing going on using the custom function urlSplit() in the middle, but I won’t bore you with that.

This method kind of worked – input a playlist URL, it spit out all the links to its videos. But it didn’t feel right. Especially when using set() to remove duplicates, I lost the order of these videos and had to parse their indices to sort them again – this seemed counter intuitive to me.

Iteration #2:

I finally realized that I must check the HTML code to find out what’s really behind these links. As I mentioned above – I really shouldn’t have started any coding without doing so!

Inspecting HTML elements using Chrome browser is very handy. Right click the element, in this case the link to a video, select Inspect or press Ctrl + Shift + I keys to bring up the page inspection console with the entire <a> tag of the link highlighted.

Oh, Boy! Why didn’t I do it! It’s now so obvious that the video thumbnail and title links, although exactly the same, they are wrapped around <a> tags of different classes. Forget about all the duplication removal and indices, I can simply use Beautiful Soup to extract only links that belong to this pl-video-title-link class. How beautiful is that? Here comes the second iteration of my code:

Only the last line was different, but it’s simpler and more elegant. Because Beautiful Soup is parsing the links from top to bottom, so all of the URLs in the list are in the correct order, there is no need to fiddle with them. In the end I put the codes into a class for ease of usage and added a input URL validation check. Here is the full code:

As you may have spotted in the code, in addition to the URLs, I have also included the titles of the videos in the playlist – it makes sense doesn’t it? This code is packaged and can be downloaded here from github. This code is in the code folder. There is also a main.py file in the same folder showing an example of using this class.