Building a chan archiver

Starting off a new ambitious project, in this case a chan scraper and archiver, requires a lot of thought. The first chan scraper I made was a simple script that would scrape 4chan’s /b/ but it blindly downloaded just the large images of every thread it saw and overwrite the previous files without considering whether or not it had downloaded them before.
I actually wanted to create a setup that would ultimately allow the indexing of all chan boards that I wanted. Though I had all of this planned out it would ultimately take a lot of changes in how the program interacted and executed with itself. This project would ultimately grow into thousands of lines of code.

The most important part of the process is building the actual scraping and gathering of the media and text on the chan. The difficult thing about 4chan (in this case) is they made it so the site is javascript that is then rendered into the html, which made it a bit hard to scrape. I had to use a headless browser instance, in this case http://phantomjs.org/ to render the javascript and then parse with Python’s beautiful soup library https://www.crummy.com/software/BeautifulSoup/.

The different boards on 4chan are posted to, updated, at different rates. One can of course blindly scrape 4chan but I designed a bot that would scan the catalog and determine whether or not the thread existed and if it existed I would check to see if the rank of the thread had increased. If the rank of the thread had increased the bot would then revisit the thread, or if it was a new thread of course visit the new thread.
/b/ moves the fastest of course. At this time I haven’t gone through to determine which of the boards moves the fastest.

After what is determined what needs scraping the bot goes to designated threads, the threads are then put into a database called {board name}_mod (/b/ would be b_mod) where mod is short for moderation. Depending on the board the content may need to be looked over to be considered safe to index.
The database stores the text and where the image location is stored locally. If everything is approved with the images and posts then a separate script uploads the local file to imgur. During the imgur upload process the url is stored in the database.

After all scripts have been run the {board name}_mod database takes all rows and moves them to a {board name}_archive. After time has passed scripts render the rows of the {board name}_archive into html and index files which are then rsync to a hosting server with nginx.

That’s the new chan archiver in a nutshell. It can be found here: https://github.com/andrewsyc/chanarchiver

Leave a Reply

Your email address will not be published. Required fields are marked *