The mining of a webmaster forum

There are webmaster forums for just about every kind of online industry, and offline as well. Many online mediums have experienced growth and popularity, only to ebb as newer platforms overtook them.

One such popular online industry is adult, or simply stated: porn. Adult has gone from a relatively scarce resource to that which is over abundant and where even minors readily have access to just about anything they want.

It was not always this way and this can be inferred by analyzing one of the most popular adult webmaster forums on the web. GFY.com, gofuckyourself.com is one of the fastest moving webmaster forums for adult program webmasters. Cumulatively there have been over 20 million posts on GFY over the course of 15 years. Things on the forum have really slowed down over time.

Adult has gone from several individual webmasters and has merge into large conglomerates like Mindgeek, Gammae, Kink, Bangbros, and other entities along the line of cams and dating.

Plagued with trolls and even very offensive behavior, GFY is one of the best public troves to infer activity over the years. So I did just that, out of curiosity I scraped, parsed, and ran some queries to see some results from the forums. Below is how I did it.

Scraping text from threads going back from the inception of the board to present day was the beginning. I simply used a PHP script utilizing cURL and used a for loop, simple concept, code below.

 

<?php
/**
 * Created by JetBrains PhpStorm.
 * Date: 5/16/14
 * Time: 6:43 PM
 * author: andrewsyc
 * To change this template use File | Settings | File Templates.
 */



//
$GFY_THREAD = "http://gfy.com/showthread.php?t=";
$thread_ID = 1;

//set to increment for each webscraper to 300,000 threads
$thread_max = $thread_ID + 300000;
$DIRECTORY = "/home/GFY/threads/";
$curl = curl_init();
for ($thread_ID = 1; $thread_ID < $thread_max; $thread_ID++)

{

    $fp = fopen($DIRECTORY . "/" . $thread_ID . ".txt", "w");

    $curl = curl_init();
    curl_setopt($curl, CURLOPT_URL, $GFY_THREAD . $thread_ID . "");
    curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
    curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
    curl_setopt($curl, CURLOPT_FILE, $fp);

    $result = curl_exec($curl);

    fwrite($fp, $result);
    curl_close($curl);
    fclose($fp);


    /**
     * Pagination Loop for threads with more than 49 posts
     */
    $fp = file_get_contents($DIRECTORY . "/" . $thread_ID . ".txt");
    echo "\n Thread " . $thread_ID;
//    echo $fp;

    if (preg_match('/\btitle="Next Page\b/i', $fp))

    {
        echo "it did match";
        //checks if thread has more than 49 posts
        $page_value = 2;

        for ($page_value = 2; $page_value < 200; $page_value++)

        {

            $fp = fopen($DIRECTORY . "/" . $thread_ID . "_" . $page_value . ".txt", "w");
            $curl = curl_init();
            curl_setopt($curl, CURLOPT_URL, $GFY_THREAD . $thread_ID . "&page=" . $page_value);
            curl_setopt($curl, CURLOPT_RETURNTRANSFER, 1);
            curl_setopt($curl, CURLOPT_FOLLOWLOCATION, 1);
            curl_setopt($curl, CURLOPT_FILE, $fp);


            $result = curl_exec($curl);
            fwrite($fp, $result);


            curl_close($curl);
            $fp = file_get_contents($DIRECTORY . "/" . $thread_ID . "_" . $page_value . ".txt");
            if (!preg_match('/\btitle="Next Page\b/i', $fp))

            {

                break;

            }

            fclose($fp);
        }

    }

    usleep(500);

}
?>

After downloading all of the text, over 200 GB and zipping them, I downloaded them to my local machine. Uncompressed text really does get shrunk well when zipped, try from 200 GB to less than 20 GB.

Next I had to create a parsing script, which I did in Python and used BeautifulSoup4, SQLAlchemy, and MySQL to parse and store in a database for later queries.

Writing the script to parse took a lot of trial and error but the actual running of the script, cumulatively, took over 6 days to parse over 1.2 million text files. On average 50 user post entries were parsed every second. 20 million posts takes a while.

from sqlalchemy.ext.automap import automap_base
from sqlalchemy import create_engine
from sqlalchemy import MetaData
from sqlalchemy import Table, Column, Integer, Numeric, String, ForeignKey, text
from sqlalchemy import insert
from sqlalchemy import engine
Base = automap_base()
from bs4 import BeautifulSoup
import os
import re
import time
Time the script starter
start = time.time()
directory = '/path/to/textfiles'

# file number tracker
i = 1

for file in os.listdir(directory):
print i
i = i + 1
if file.endswith('.txt'):
threadID= year= month= day= hour= minute= join_month= join_year= post_in_thread= post_number = 0
user_name= AMorPM= status= location= message = ""

# try:
f = open(directory + '/' + file, 'r+', )
threadID = file.split('.')[0]

soup = BeautifulSoup(f.read(), 'lxml')
engine = create_engine('mysql+pymysql://user:pass'
'@localhost/GFY_2016')

post_in_thread = 0
thread_title = ""
posts = soup.find_all('table', attrs={'id':re.compile('post')})
for p in posts:

items = BeautifulSoup(str(p), 'lxml')
date = items.find('td', attrs={'class':'thead'})
date_string = BeautifulSoup(str(date)).get_text().strip()
parsed_date = date_string.split('-')

try:
# Gets the month, day, year from the extracted text
month = parsed_date[0]

# print "day: " + parsed_date[1]
day = parsed_date[1]

parsed_date = parsed_date[2].split(',')
year = parsed_date[0]

post_time = parsed_date[1].split(':')
hour = post_time[0]
minute = post_time[1].split(' ')[0]
AMorPM = post_time[1].split(' ')[1]

except:
pass

try:
post_number = items.find('a', attrs={'target':'new'})
test = BeautifulSoup(str(post_number))
post_in_thread = test.get_text()

# Get the username of the individual
user_name = items.find('a', attrs={'class':'bigusername'})
name = BeautifulSoup(str(user_name)).get_text()
user_name = name
# print name
except:
pass

try:
# Get the status of the user, e.g. confirmed or so fucking banned
status = items.find('div', attrs={'class':'smallfont'})
status = BeautifulSoup(str(status)).get_text()
# print status

# Join date
join_date = items.find(string=re.compile("Join Date:"))
join_date = BeautifulSoup(str(join_date)).get_text()
# print join_date
join_month = join_date.split(' ')[2]
join_year = join_date.split(' ')[3]

except:
pass

# Location
try:
location = items.find(string=re.compile("Location:"))
location = BeautifulSoup(str(location)).get_text()
except:
pass
# print "Location: null"

try:
posts = items.find(string=re.compile("Posts:"))
posts = BeautifulSoup(str(posts)).get_text().strip()
posts = posts.split(' ')[1].replace(',','')
post_number = posts
except:
pass
# print "Posts: null"

try:
# print items
# print items.find('div', attrs={'id', re.compile('post_message')})
# print items.find_all(id=re.compile('post_message'))
message = BeautifulSoup(str(items.find_all(id=re.compile('post_message')))).get_text()

message = message.replace('\\n','').replace(']', '').replace('[', '').replace('\\r', '')
# print message
except:
pass
# print "message: null"

# This code creates a new thread entry if the post is determined to be the first one
if test.get_text() == '1':

try:
# Select table here and make new thread title
title_block = items.find('td', attrs={'class','alt1'})
thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
thread_title = re.search('(?<=)(.*?)(?= # print thread_title.group(0)
# print "This is the first post"
metadata = MetaData()
thread = Table('threads', metadata,
Column('title', String),
Column('threadID', String),
Column('title', String),
Column('username', String),
Column('year', Integer),
Column('month', Integer),
Column('day', Integer),
Column('hour', Integer),
Column('minute', Integer),
Column('AMorPM', String),
# Column('post_date', String(20)),
# Column('post_name', String(255), index=True),
# Column('post_url', String(255)),
# Column('post_content', String(20000))
)
metadata.create_all(engine)

# Make sure to add items here that were parsed
ins = insert(thread).values(
threadID=threadID,
title=thread_title.group(0),
username=user_name,
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
AMorPM=AMorPM
# post_name=title,
# post_url=url,
# post_content=string,
)

# insert into database the parsed logic
engine.execute(ins)
# engine.dispose()
# engine = create_engine('mysql+pymysql://user:pass'
# '@localhost/GFY_2016')
except:
pass

try:
# print 'This is trying to insert into posts:'
# Select table here and make new thread title
# title_block = items.find('td', attrs={'class','alt1'})
# thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
# thread_title = BeautifulSoup(str(title_block)).find('div', attrs={'class':'smallfont'})
# thread_title = re.search('(?<=)(.*?)(?= # print thread_title.group(0)
# print "This is the first post"
metadata = MetaData()
posts = Table('posts', metadata,
Column('threadID', String),
Column('username', String),
Column('year', Integer),
Column('month', Integer),
Column('day', Integer),
Column('hour', Integer),
Column('minute', Integer),
Column('AMorPM', String),
Column('join_year', Integer),
Column('join_month', String),
Column('post_in_thread', Integer),
Column('postcount', Integer),
Column('message', String)

)
metadata.create_all(engine)

# Make sure to add items here that were parsed
ins = insert(posts).values(
threadID=threadID,
username=user_name,
year=year,
month=month,
day=day,
hour=hour,
minute=minute,
AMorPM=AMorPM,
join_year=join_year,
join_month=join_month,
post_in_thread=post_in_thread,
postcount=post_number,
message=message

)

# insert into database the parsed logic
engine.execute(ins)
except:
pass

# print "\n"
# connection.close()
# except:
engine.dispose()
# engine.close()
# pass

print time.time() - start

After parsing I had a Database with 2 tables, threads started and posts by each user. The image of the table structure can be seen below.

GFY webmaster forum db scrape

Queries are as follows:

In order to find the people with the most posts I ran this query on the db.

SELECT DISTINCT username, postcount FROM posts ORDER BY postcount DESC;

I should mention that I put the output into excel.

CVihIMt

To find the person starting the most threads

SELECT username, COUNT(*) as count FROM threads GROUP BY username ORDER BY count DESC;

Top posters on GFY

The interesting trend line comes from finding all posts on a given month over time. This script used Python as it involved many nested queries.

 
for i in range(2001, 2017):
    for j in range(1, 13):
        thread_activity = "SELECT COUNT(*) as count FROM threads WHERE year = {0} and month = {1}".format(i, j)
        q = connection.execute(thread_activity)

        for r in q:
            print str(r[0])

When put into a chart this yields a cool graph as such:

GFY_posts_over_years

Things have really tapered off, a couple of users suggested that this trend line is also probably very similar to affiliate earnings over the years. I made threads about these over on gfy.com, you can read them yourself.

http://gfy.com/fucking-around-and-program-discussion/1193279-gfy-post-activity-threads-categories.html

http://gfy.com/fucking-around-and-program-discussion/1193102-top-100-prolific-posters-gfy-time.html

Any questions? I had a lot of fun writing this and looking at the data. Lots more could be done with it but old data in this case isn’t worth much. Most of the programs are dead and gone that are littered throughout ancient history on this board. Mostly this is just a tear jerker to those who used to be around during the good old days. Still fascinating though.