Wednesday, March 21st, 2018

SOLR 7.2.1 - you are rather good :)

I’m so glad to be using SOLR in this side project.

I can recommend the tutorial, which is excellent.

Saturday, March 10th, 2018

Using Scrapy to develop a basic crawler

Take the xpath from XPathHelper, and use in a simple crawler:

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'ssvc'
    start_urls = ['']

    def parse(self, response):
        for h3 in response.xpath('//div[@class="list-inner"]/a[@class="topictitle"]').extract():
            yield {"topictitle": h3}

Run the crawler from Intellij by using the following Run / Debug configuration:

Script ~/virtualEnvs/advert/lib/python2.7/site-packages/scrapy/

Parameters runspider

Use the, with the scrapy parameters (you would use on the command line.

Running gives the following output
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121296&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">Eh!</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=120683&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1967 RHD Split Screen</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121305&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1965 splitscreen LHD for sale</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121293&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">67 SO42 Westy</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121245&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1964 VW SplitScreen Single Cab - OG Paint</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200>

The advert title and urls are being parsed :)

XPath is not something I use every day

I have been using the XPathHelper (Chrome plugin) with my xpath queries. It provides a good way of parsing advertisement titles and a href.


February 2018

I’ve been wanting to push my latest side project forward for quite some time. It’s now time to do just that.

I’ve had the idea of developing a Microservices based product which finds “VW adverts” based on keywords and phrases. You register, enter your phrases, and off you go.

The tech stack is Spring Boot, with MongoDB. After some investigation, i’m going to use Python Scrapy for the crawling.



virtualenv -p python3 spider

$ scrapy startproject spider
Traceback (most recent call last):
    TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'

So this is the first config error. As I want fully automated infrastructure, this will need to be solved. Enough for today :0/

Monday May 30th, 2017

Fixes are easy when you you know how. To fix the OP_NO_TLSv1_1, I had to re-install OpenSSL, and then downgrade the Python module:

pip install Twisted==16.4.1

Scrapy is now working fine :O)

The next task is to get an xpath so I can start getting some advert data.