Building a chan archiver

Starting off a new ambitious project, in this case a chan scraper and archiver, requires a lot of thought. The first chan scraper I made was a simple script that would scrape 4chan’s /b/ but it blindly downloaded just the large images of every thread it saw and overwrite the previous files without considering whether or not it had downloaded them before.
I actually wanted to create a setup that would ultimately allow the indexing of all chan boards that I wanted. Though I had all of this planned out it would ultimately take a lot of changes in how the program interacted and executed with itself. This project would ultimately grow into thousands of lines of code.

The most important part of the process is building the actual scraping and gathering of the media and text on the chan. The difficult thing about 4chan (in this case) is they made it so the site is javascript that is then rendered into the html, which made it a bit hard to scrape. I had to use a headless browser instance, in this case to render the javascript and then parse with Python’s beautiful soup library

The different boards on 4chan are posted to, updated, at different rates. One can of course blindly scrape 4chan but I designed a bot that would scan the catalog and determine whether or not the thread existed and if it existed I would check to see if the rank of the thread had increased. If the rank of the thread had increased the bot would then revisit the thread, or if it was a new thread of course visit the new thread.
/b/ moves the fastest of course. At this time I haven’t gone through to determine which of the boards moves the fastest.

After what is determined what needs scraping the bot goes to designated threads, the threads are then put into a database called {board name}_mod (/b/ would be b_mod) where mod is short for moderation. Depending on the board the content may need to be looked over to be considered safe to index.
The database stores the text and where the image location is stored locally. If everything is approved with the images and posts then a separate script uploads the local file to imgur. During the imgur upload process the url is stored in the database.

After all scripts have been run the {board name}_mod database takes all rows and moves them to a {board name}_archive. After time has passed scripts render the rows of the {board name}_archive into html and index files which are then rsync to a hosting server with nginx.

That’s the new chan archiver in a nutshell. It can be found here:

Rewriting a script to work with a hosting provider

There are all kinds of ways and reasons to rewrite a script or package of scripts you wrote, would you do it for the hosting provider you selected? Would I be right in saying no? Wise thing to do. I spent $42 for a month of hosting with, a subsidiary of

Caveat: The reason for using a library like dryscrape is I needed to render Javascript before scraping as the webpage needed an interpreter to render the html, which I could then scrape. Otherwise the argument to just make a simple bot with standard Python libraries is valid.

I recently wrote a scraper script that utilized a python library called dryscrape. The script was ready to go, I had tested it locally, on a Raspberry Pi, and on a Digital Ocean instance.
I got a server with soyoustart because I wanted 2 TB of disk space as I was scraping a lot of digital media. Everything was easy enough with getting the environment up and running but upon trying to get dryscrape up I got:

g++: internal compiler error: Segmentation fault (program cc1plus)

Please submit a full bug report,

with preprocessed source if appropriate.

See  for instructions.

Makefile.webkit_server:1006: recipe for target 'build/Version.o' failed

make[1]: *** [build/Version.o] Error 4

make[1]: Leaving directory '/tmp/pip-build-jI5qGh/webkit-server/src'

Makefile:38: recipe for target 'sub-src-webkit_server-pro-make_first-ordered' failed

make: *** [sub-src-webkit_server-pro-make_first-ordered] Error 2

error: [Errno 2] No such file or directory: 'src/webkit_server'

My solution, try different distros of linux, different versions of different distros, none of that worked. Next I tried every alternative to installing dry scrape that was offered, including brew (which normally runs on mac) to be used on linux. Again and again error after compile error, all the same thing. I tried different versions of gcc/gcc+, using different source.list to see if that would help. Argh!

I next tried Selenium and faking FireFox Ice Weasel into thinking it was running a window, but still had issues.

Finally I decided to use an implantation of PhantomJS with Python, I ended up using a new Python environment, Anaconda. After running the installation script I found a way to get PhantomJS working with it. All I needed to do was rewrite some things to work with PhantomJS.
While proxy support is not as robust, everything has worked well.

Summary: I learned a lot of course (-: Never give up!
Actually these exercises, in my opinion are never a waste. I learned so much in the process of trying all the different alternatives. This is just a compile error that I think resorts back to OVH’s (’s) implementation of the Operating System that they have you install on the server you order. This is a rare occurrence and overall I think OVH is hell of a bang for your buck in hosting.

Bots are useful, even in your personal life.

Bots are everywhere and they aren’t such a bad thing. Sure people write them to be spammy and clog up parts of the web. Webmasters tend to hate bots as spammers use them to invade their communities and wreck havoc at times.

Bots are great though, they are the backbone of companies like Google, they are what allow Google to update the information they have on websites around the web so they can then determine relevant information to feed to your browser when you do a search.

One example of a bot I wrote was to notify me every time a new list of books came out on a specific website. Granted this site had copyright infringed books but I still thought it was a neat thing to try out. I should disclaim that I am a member of and essentially yields this experiment for just entertainment value.

The post emailer bot code for book notifier for the now defunct below.


from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import select
from sqlalchemy import Table, Column, Integer, Numeric, String, ForeignKey, text
from sqlalchemy import PrimaryKeyConstraint, UniqueConstraint, CheckConstraint
from sqlalchemy import insert
from sqlalchemy import engine
from sqlalchemy.ext.declarative import declarative_base
Base = automap_base()
from datetime import datetime
from sqlalchemy import DateTime

import hashlib
import urllib
import urllib2
import os
import re
import datetime
import calendar
import time

from bs4 import BeautifulSoup

import smtplib
import pymysql

response = urllib.urlopen("")

m = hashlib.md5()

soup = BeautifulSoup(
tag = soup.find_all("td", attrs={"class": "top"})
print str(tag)

engine = create_engine('mysql+pymysql://user:password@localhost/database_name')

check_duplicate = engine.execute("SELECT MAX(id) FROM %s " % 'itebooks')

for i in check_duplicate:
print i[0]
row = engine.execute("SELECT * FROM {} WHERE id = {}".format('itebooks', str(i[0])))

for i in row:
print i[1]
if m.hexdigest() != i[1]:
print "time to update"
engine.execute("INSERT INTO {} ({}) VALUES ('{}')".format('itebooks', 'checker', m.hexdigest()))
msg = 'it-ebooks has updated.'

server = smtplib.SMTP('',587) #port 465 or 587

print "no need to update"

This script ran via a cron job which I would set to run every 8 hours. The logic was simple, the home page would be botted, the file would then be parsed at a certain level and then fed into a hashing algorithm. The hashing algorithm would then run and compare this to the previous hash in the database. If the hash was the same it meant no change, had it changed it would send out an email to a specific address and would notify me that new books were on the site. Kinda fun actually. (:

Bots are simply programs that can run in the background all the time and go out and either put or retrieve information and then do something with it. In this case I built a bot to check if parts of a web page had changed and I then used that info to check a db and if there was a change I would get an email.

Bots that are engineered carefully are great for breaking problems down into mere seconds or minutes that would take a human days or even years to do. Great for automating work and things like dating (-;

Viewing levels from the original Tomb Raider games

Growing up one of my favorite games to play was Tomb Raider, most of the series in fact is enjoyable. For some odd reason I decide to do a little research on the development of the game after listening to the developer commentary in the remake, Tomb Raider Anniversary. Here is one video example of the developers discussing design through Palace Midas:

Tomb Raider 1 was made around the same era as Doom, a few years afterward, and had beautiful graphics and a 3d rendering engine at the time. Levels, that at the time, were breath taking that had music to trigger emotional feel and effect. None the less I felt compelled to explore anything that people may have found on it.

Forums are everywhere on the internet and full of enthusiasts, it always amazes me that there are members that have tens of thousands of posts on a niche forum.

On the and the web I found a game called Tomb Viewer, an engine that allows people to load the Tomb Raider level files in which you can move through using a first person camera. If you ever doubted a missing secret or parts of a stage, this is your opportunity to search every nook and cranny in the level.

It’s always fun to get an old program up and running, what made it gratifying was for me to fly through walls and look at the underside of specific rooms and even see level bulges where certain secrets lay.

I managed to pull out the level files from an original tomb raider demo I found online along with the levels for Tomb Raider Unfinished business and Tomb Raider 2, I had to find the PC versions as the Playstation versions had .PSX file formats which I could not read.

Enough of the program was layed out that I contacted the author

He even gave me the source code to look at, despite the difficulty of getting it to compile. I was still grateful and thought it really need in what he had done.

It’s gratifying to authors if you: read their work/tutorials, try it out and get the code to work, and then email them upon success. This is a great way to build up you network with people who’ve done some really cool things.