Rewriting a script to work with a hosting provider

There are all kinds of ways and reasons to rewrite a script or package of scripts you wrote, would you do it for the hosting provider you selected? Would I be right in saying no? Wise thing to do. I spent $42 for a month of hosting with soyoustart.com, a subsidiary of ovh.com.

Caveat: The reason for using a library like dryscrape is I needed to render Javascript before scraping as the webpage needed an interpreter to render the html, which I could then scrape. Otherwise the argument to just make a simple bot with standard Python libraries is valid.

I recently wrote a scraper script that utilized a python library called dryscrape. The script was ready to go, I had tested it locally, on a Raspberry Pi, and on a Digital Ocean instance.
I got a server with soyoustart because I wanted 2 TB of disk space as I was scraping a lot of digital media. Everything was easy enough with getting the environment up and running but upon trying to get dryscrape up I got:

g++: internal compiler error: Segmentation fault (program cc1plus)

Please submit a full bug report,

with preprocessed source if appropriate.

See  for instructions.

Makefile.webkit_server:1006: recipe for target 'build/Version.o' failed

make[1]: *** [build/Version.o] Error 4

make[1]: Leaving directory '/tmp/pip-build-jI5qGh/webkit-server/src'

Makefile:38: recipe for target 'sub-src-webkit_server-pro-make_first-ordered' failed

make: *** [sub-src-webkit_server-pro-make_first-ordered] Error 2

error: [Errno 2] No such file or directory: 'src/webkit_server'

My solution, try different distros of linux, different versions of different distros, none of that worked. Next I tried every alternative to installing dry scrape that was offered, including brew (which normally runs on mac) to be used on linux. Again and again error after compile error, all the same thing. I tried different versions of gcc/gcc+, using different source.list to see if that would help. Argh!

I next tried Selenium and faking FireFox Ice Weasel into thinking it was running a window, but still had issues.

Finally I decided to use an implantation of PhantomJS with Python, I ended up using a new Python environment, Anaconda. After running the installation script I found a way to get PhantomJS working with it. All I needed to do was rewrite some things to work with PhantomJS.
While proxy support is not as robust, everything has worked well.

Summary: I learned a lot of course (-: Never give up!
Actually these exercises, in my opinion are never a waste. I learned so much in the process of trying all the different alternatives. This is just a compile error that I think resorts back to OVH’s (soyoustart.com’s) implementation of the Operating System that they have you install on the server you order. This is a rare occurrence and overall I think OVH is hell of a bang for your buck in hosting.

Bots are useful, even in your personal life.

Bots are everywhere and they aren’t such a bad thing. Sure people write them to be spammy and clog up parts of the web. Webmasters tend to hate bots as spammers use them to invade their communities and wreck havoc at times.

Bots are great though, they are the backbone of companies like Google, they are what allow Google to update the information they have on websites around the web so they can then determine relevant information to feed to your browser when you do a search.

One example of a bot I wrote was to notify me every time a new list of books came out on a specific website. Granted this site had copyright infringed books but I still thought it was a neat thing to try out. I should disclaim that I am a member of Safaribooksonline.com and essentially yields this experiment for just entertainment value.

The post emailer bot code for book notifier for the now defunct it-ebooks.info below.

#!/usr/bin/python

from sqlalchemy.ext.automap import automap_base
from sqlalchemy.orm import Session
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table
from sqlalchemy import select
from sqlalchemy import Table, Column, Integer, Numeric, String, ForeignKey, text
from sqlalchemy import PrimaryKeyConstraint, UniqueConstraint, CheckConstraint
from sqlalchemy import insert
from sqlalchemy import engine
from sqlalchemy.ext.declarative import declarative_base
Base = automap_base()
from datetime import datetime
from sqlalchemy import DateTime

import hashlib
import urllib
import urllib2
import os
import re
import datetime
import calendar
import time

from bs4 import BeautifulSoup

import smtplib
import pymysql

response = urllib.urlopen("http://it-ebooks.info/")

m = hashlib.md5()

soup = BeautifulSoup(response.read())
tag = soup.find_all("td", attrs={"class": "top"})
print str(tag)
m.update(str(tag))

engine = create_engine('mysql+pymysql://user:password@localhost/database_name')

check_duplicate = engine.execute("SELECT MAX(id) FROM %s " % 'itebooks')

for i in check_duplicate:
print i[0]
row = engine.execute("SELECT * FROM {} WHERE id = {}".format('itebooks', str(i[0])))

for i in row:
print i[1]
if m.hexdigest() != i[1]:
print "time to update"
engine.execute("INSERT INTO {} ({}) VALUES ('{}')".format('itebooks', 'checker', m.hexdigest()))
msg = 'it-ebooks has updated.'

server = smtplib.SMTP('smtp.gmail.com',587) #port 465 or 587
server.ehlo()
server.starttls()
server.ehlo()
server.login('from@gmail.com','password')
server.sendmail('from@gmail.com','to@gmail.com',msg)
server.close()

else:
print "no need to update"

This script ran via a cron job which I would set to run every 8 hours. The logic was simple, the home page would be botted, the file would then be parsed at a certain level and then fed into a hashing algorithm. The hashing algorithm would then run and compare this to the previous hash in the database. If the hash was the same it meant no change, had it changed it would send out an email to a specific address and would notify me that new books were on the site. Kinda fun actually. (:

Bots are simply programs that can run in the background all the time and go out and either put or retrieve information and then do something with it. In this case I built a bot to check if parts of a web page had changed and I then used that info to check a db and if there was a change I would get an email.

Bots that are engineered carefully are great for breaking problems down into mere seconds or minutes that would take a human days or even years to do. Great for automating work and things like dating (-;

Viewing levels from the original Tomb Raider games

Growing up one of my favorite games to play was Tomb Raider, most of the series in fact is enjoyable. For some odd reason I decide to do a little research on the development of the game after listening to the developer commentary in the remake, Tomb Raider Anniversary. Here is one video example of the developers discussing design through Palace Midas:

Tomb Raider 1 was made around the same era as Doom, a few years afterward, and had beautiful graphics and a 3d rendering engine at the time. Levels, that at the time, were breath taking that had music to trigger emotional feel and effect. None the less I felt compelled to explore anything that people may have found on it.

Forums are everywhere on the internet and full of enthusiasts, it always amazes me that there are members that have tens of thousands of posts on a niche forum.

On the tombraiderforums.com and the web I found a game called Tomb Viewer, an engine that allows people to load the Tomb Raider level files in which you can move through using a first person camera. If you ever doubted a missing secret or parts of a stage, this is your opportunity to search every nook and cranny in the level.

It’s always fun to get an old program up and running, what made it gratifying was for me to fly through walls and look at the underside of specific rooms and even see level bulges where certain secrets lay.

I managed to pull out the level files from an original tomb raider demo I found online along with the levels for Tomb Raider Unfinished business and Tomb Raider 2, I had to find the PC versions as the Playstation versions had .PSX file formats which I could not read.

Enough of the program was layed out that I contacted the author http://www.geocities.ws/jimmyvalavanis/applications/tombviewer.html

http://www.tombraiderforums.com/showthread.php?t=146951&page=2

https://github.com/andrewsyc/Tomb-Raider-1-2-3-4-Map-viewer-and-levels

He even gave me the source code to look at, despite the difficulty of getting it to compile. I was still grateful and thought it really need in what he had done.

It’s gratifying to authors if you: read their work/tutorials, try it out and get the code to work, and then email them upon success. This is a great way to build up you network with people who’ve done some really cool things.

The mining of a webmaster forum

There are webmaster forums for just about every kind of online industry, and offline as well. Many online mediums have experienced growth and popularity, only to ebb as newer platforms overtook them.

One such popular online industry is adult, or simply stated: porn. Adult has gone from a relatively scarce resource to that which is over abundant and where even minors readily have access to just about anything they want.

It was not always this way and this can be inferred by analyzing one of the most popular adult webmaster forums on the web. GFY.com, gofuckyourself.com is one of the fastest moving webmaster forums for adult program webmasters. Cumulatively there have been over 20 million posts on GFY over the course of 15 years. Things on the forum have really slowed down over time.

Adult has gone from several individual webmasters and has merge into large conglomerates like Mindgeek, Gammae, Kink, Bangbros, and other entities along the line of cams and dating.

Plagued with trolls and even very offensive behavior, GFY is one of the best public troves to infer activity over the years. So I did just that, out of curiosity I scraped, parsed, and ran some queries to see some results from the forums. Below is how I did it.

Scraping text from threads going back from the inception of the board to present day was the beginning. I simply used a PHP script utilizing cURL and used a for loop, simple concept, code below.

 

<?php
/**
 * Created by JetBrains PhpStorm.
 * Date: 5/16/14
 * Time: 6:43 PM
 * author: andrewsyc
 * To change this template use File | Settings | File Templates.
 */



//
$GFY_THREAD = "http://gfy.com/showthread.php?t=";
$thread_ID = 1;

//set to increment for each webscraper to 300,000 threads
$thread_max = $thread_ID + 300000;
$DIRECTORY = "/home/GFY/threads/";
$curl = curl_init();
for ($thread_ID = 1; $thread_ID < $thread_max; $thread_ID++)

{

    $fp = fopen($DIRECTORY . "/" . $thread_ID . ".txt", "w");

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $GFY_THREAD . $thread_ID . "");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($curl, CURLOPT_FILE, $fp);

    $result = curl_exec($curl);

    fwrite($fp, $result);
    curl_close($curl);
    fclose($fp);


    /**
     * Pagination Loop for threads with more than 49 posts
     */
    $fp = file_get_contents($DIRECTORY . "/" . $thread_ID . ".txt");
    echo "\n Thread " . $thread_ID;
//    echo $fp;

    if (preg_match('/\btitle="Next Page\b/i', $fp))

    {
        echo "it did match";
        //checks if thread has more than 49 posts
        $page_value = 2;

        for ($page_value = 2; $page_value < 200; $page_value++)

        {

            $fp = fopen($DIRECTORY . "/" . $thread_ID . "_" . $page_value . ".txt", "w");
            $curl = curl_init();
            curl_setopt($curl, CURLOPT_URL, $GFY_THREAD . $thread_ID . "&page=" . $page_value);
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
            curl_setopt($curl, CURLOPT_FILE, $fp);


            $result = curl_exec($curl);
            fwrite($fp, $result);


            curl_close($curl);
            $fp = file_get_contents($DIRECTORY . "/" . $thread_ID . "_" . $page_value . ".txt");
            if (!preg_match('/\btitle="Next Page\b/i', $fp))

            {

                break;

            }

            fclose($fp);
        }

    }

    usleep(500);

}
?>

After downloading all of the text, over 200 GB and zipping them, I downloaded them to my local machine. Uncompressed text really does get shrunk well when zipped, try from 200 GB to less than 20 GB.

Next I had to create a parsing script, which I did in Python and used BeautifulSoup4, SQLAlchemy, and MySQL to parse and store in a database for later queries.

Writing the script to parse took a lot of trial and error but the actual running of the script, cumulatively, took over 6 days to parse over 1.2 million text files. On average 50 user post entries were parsed every second. 20 million posts takes a while.

from sqlalchemy.ext.automap import automap_base
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table, Column, Integer, Numeric, String, ForeignKey, text
from sqlalchemy import insert
from sqlalchemy import engine
Base = automap_base()
from bs4 import BeautifulSoup
import os
import re
import time
Time the script starter
start = time.time()
directory = '/path/to/textfiles'

# file number tracker
i = 1

for file in os.listdir(directory):
print i
i = i + 1
if file.endswith('.txt'):
threadID= year= month= day= hour= minute= join_month= join_year= post_in_thread= post_number = 0
user_name= AMorPM= status= location= message = ""

# try:
f = open(directory + '/' + file, 'r+', )
threadID = file.split('.')[0]

soup = BeautifulSoup(f.read(), 'lxml')
engine = create_engine('mysql+pymysql://user:pass'
'@localhost/GFY_2016')

post_in_thread = 0
thread_title = ""
posts = soup.find_all('table', attrs={'id':re.compile('post')})
for p in posts:

items = BeautifulSoup(str(p), 'lxml')
date = items.find('td', attrs={'class':'thead'})
date_string = BeautifulSoup(str(date)).get_text().strip()
parsed_date = date_string.split('-')

try:
# Gets the month, day, year from the extracted text
month = parsed_date[0]

# print "day: " + parsed_date[1]
day = parsed_date[1]

parsed_date = parsed_date[2].split(',')
year = parsed_date[0]

post_time = parsed_date[1].split(':')
hour = post_time[0]
minute = post_time[1].split(' ')[0]
AMorPM = post_time[1].split(' ')[1]

except:
pass

try:
post_number = items.find('a', attrs={'target':'new'})
test = BeautifulSoup(str(post_number))
post_in_thread = test.get_text()

# Get the username of the individual
user_name = items.find('a', attrs={'class':'bigusername'})
name = BeautifulSoup(str(user_name)).get_text()
user_name = name
# print name
except:
pass

try:
# Get the status of the user, e.g. confirmed or so fucking banned
status = items.find('div', attrs={'class':'smallfont'})
status = BeautifulSoup(str(status)).get_text()
# print status

# Join date
join_date = items.find(string=re.compile("Join Date:"))
join_date = BeautifulSoup(str(join_date)).get_text()
# print join_date
join_month = join_date.split(' ')[2]
join_year = join_date.split(' ')[3]

except:
pass

# Location
try:
location = items.find(string=re.compile("Location:"))
location = BeautifulSoup(str(location)).get_text()
except:
pass
# print "Location: null"

try:
posts = items.find(string=re.compile("Posts:"))
posts = BeautifulSoup(str(posts)).get_text().strip()
posts = posts.split(' ')[1].replace(',','')
post_number = posts
except:
pass
# print "Posts: null"

try:
# print items
# print items.find('div', attrs={'id', re.compile('post_message')})
# print items.find_all(id=re.compile('post_message'))
message = BeautifulSoup(str(items.find_all(id=re.compile('post_message')))).get_text()

message = message.replace('\\n','').replace(']', '').replace('[', '').replace('\\r', '')
# print message
except:
pass
# print "message: null"

# This code creates a new thread entry if the post is determined to be the first one
if test.get_text() == '1':

try:
# Select table here and make new thread title
title_block = items.find('td', attrs={'class','alt1'})
thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
thread_title = re.search('(?<=)(.*?)(?= # print thread_title.group(0)
# print "This is the first post"
metadata = MetaData()
thread = Table('threads', metadata,
Column('title', String),
Column('threadID', String),
Column('title', String),
Column('username', String),
Column('year', Integer),
Column('month', Integer),
Column('day', Integer),
Column('hour', Integer),
Column('minute', Integer),
Column('AMorPM', String),
# Column('post_date', String(20)),
# Column('post_name', String(255), index=True),
# Column('post_url', String(255)),
# Column('post_content', String(20000))
)
metadata.create_all(engine)

# Make sure to add items here that were parsed
ins = insert(thread).values(
threadID=threadID,
title=thread_title.group(0),
username=user_name,
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
AMorPM=AMorPM
# post_name=title,
# post_url=url,
# post_content=string,
)

# insert into database the parsed logic
engine.execute(ins)
# engine.dispose()
# engine = create_engine('mysql+pymysql://user:pass'
# '@localhost/GFY_2016')
except:
pass

try:
# print 'This is trying to insert into posts:'
# Select table here and make new thread title
# title_block = items.find('td', attrs={'class','alt1'})
# thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
# thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
# thread_title = re.search('(?<=)(.*?)(?= # print thread_title.group(0)
# print "This is the first post"
metadata = MetaData()
posts = Table('posts', metadata,
Column('threadID', String),
Column('username', String),
Column('year', Integer),
Column('month', Integer),
Column('day', Integer),
Column('hour', Integer),
Column('minute', Integer),
Column('AMorPM', String),
Column('join_year', Integer),
Column('join_month', String),
Column('post_in_thread', Integer),
Column('postcount', Integer),
Column('message', String)

)
metadata.create_all(engine)

# Make sure to add items here that were parsed
ins = insert(posts).values(
threadID=threadID,
username=user_name,
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
AMorPM=AMorPM,
join_year=join_year,
join_month=join_month,
post_in_thread=post_in_thread,
postcount=post_number,
message=message

)

# insert into database the parsed logic
engine.execute(ins)
except:
pass

# print "\n"
# connection.close()
# except:
engine.dispose()
# engine.close()
# pass

print time.time() - start

After parsing I had a Database with 2 tables, threads started and posts by each user. The image of the table structure can be seen below.

GFY webmaster forum db scrape

Queries are as follows:

In order to find the people with the most posts I ran this query on the db.

SELECT DISTINCT username, postcount FROM posts ORDER BY postcount DESC;

I should mention that I put the output into excel.

CVihIMt

To find the person starting the most threads

SELECT username, COUNT(*) as count FROM threads GROUP BY username ORDER BY count DESC;

Top posters on GFY

The interesting trend line comes from finding all posts on a given month over time. This script used Python as it involved many nested queries.

 
for i in range(2001, 2017):
    for j in range(1, 13):
        thread_activity = "SELECT COUNT(*) as count FROM threads WHERE year = {0} and month = {1}".format(i, j)
        q = connection.execute(thread_activity)

        for r in q:
            print str(r[0])

When put into a chart this yields a cool graph as such:

GFY_posts_over_years

Things have really tapered off, a couple of users suggested that this trend line is also probably very similar to affiliate earnings over the years. I made threads about these over on gfy.com, you can read them yourself.

http://gfy.com/fucking-around-and-program-discussion/1193279-gfy-post-activity-threads-categories.html

http://gfy.com/fucking-around-and-program-discussion/1193102-top-100-prolific-posters-gfy-time.html

Any questions? I had a lot of fun writing this and looking at the data. Lots more could be done with it but old data in this case isn’t worth much. Most of the programs are dead and gone that are littered throughout ancient history on this board. Mostly this is just a tear jerker to those who used to be around during the good old days. Still fascinating though.