A couple of things bugged me about trying to import data from my DFP Impression logs into a simple MySQL database. Mainly it was down to Google using some very obscure character to delimit the text. In our case (and I believe you can ask them to change this) they are using the ‘thorn’ character – þ
So, my initial thought was to do a simple stream search/replace using sed and then import the file in locally to MySQL. The only thing I was worried about was the size of the logs I was working with, a single day’s NetworkImpression data was averaging 25GB. This works out at >1GB per hour of logs!
So, here’s as simple script that I use to unzip the file, convert the þ to a \t (tab) and then import the data into my database. Your log files will be named something like:
The script requires you just use the name of the file (without the .gz fielextension). I called my script ‘importdfp’, so to import a file called NetworkImpressions_123456_20131114_08.gz I would invoke the script:
This is rough and ready but get’s the job done, I’m currently working on adding timing and logging to give a bit more feedback, as the DB table will reindex any new rows as it’s bringing these in it can be a little time consuming….
#unzip the file
#convert the thorn to a tab to a temp file, then delete the original and rename the file to the original filename
sed s/\\xFE/\\t/g $1 > $1_TMP
mv $1_TMP $1
# import into MySQL
mysql -u root -pgnajim88 -e ‘LOAD DATA INFILE “/home/location/of/your/dfp/log/files/’$1′” INTO TABLE `LOG_IMPS` FIELDS TERMINATED BY “\t” ESCAPED BY “\\” LINES TERMINATED BY “\n” IGNORE 1 LINES;’ DFP
Yes, I know I should be using something like Hadoop, but this is a rough-and-ready test to prove we could bring the data in in a simple fashion, query it and get some basic data out. I was doing this on a simple dual VPS with 2GB of memory and it was taking approx 5-7mins to import each log file (including unzipping, sed transformations and loading the file into the DB and indexing the records).