design

Wednesday, March 21st, 2018

SOLR 7.2.1 - you are rather good :)

I’m so glad to be using SOLR in this side project.

I can recommend the tutorial, which is excellent.


Saturday, March 10th, 2018

Using Scrapy to develop a basic crawler

Take the xpath from XPathHelper, and use in a simple crawler:

import scrapy


class BlogSpider(scrapy.Spider):
    name = 'ssvc'
    start_urls = ['https://ssvc.org.uk/phpbb/viewforum.php?f=51']

    def parse(self, response):
        for h3 in response.xpath('//div[@class="list-inner"]/a[@class="topictitle"]').extract():
            yield {"topictitle": h3}

Run the crawler from Intellij by using the following Run / Debug configuration:

Script ~/virtualEnvs/advert/lib/python2.7/site-packages/scrapy/cmdline.py

Parameters runspider crawler_ssvc.py

Use the cmdline.py, with the scrapy parameters (you would use on the command line.

Running gives the following output
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121296&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">Eh!</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=120683&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1967 RHD Split Screen</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121305&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1965 splitscreen LHD for sale</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121293&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">67 SO42 Westy</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&amp;t=121245&amp;sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1964 VW SplitScreen Single Cab - OG Paint</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
[snip]

The advert title and urls are being parsed :)


XPath is not something I use every day

I have been using the XPathHelper (Chrome plugin) with my xpath queries. It provides a good way of parsing advertisement titles and a href.

design


February 2018

I’ve been wanting to push my latest side project forward for quite some time. It’s now time to do just that.

I’ve had the idea of developing a Microservices based product which finds “VW adverts” based on keywords and phrases. You register, enter your phrases, and off you go.

The tech stack is Spring Boot, with MongoDB. After some investigation, i’m going to use Python Scrapy for the crawling.


design


Scrapy

virtualenv -p python3 spider

$ scrapy startproject spider
Traceback (most recent call last):
[snip]
    TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'

So this is the first config error. As I want fully automated infrastructure, this will need to be solved. Enough for today :0/

Monday May 30th, 2017

Fixes are easy when you you know how. To fix the OP_NO_TLSv1_1, I had to re-install OpenSSL, and then downgrade the Python module:

pip install Twisted==16.4.1

Scrapy is now working fine :O)

The next task is to get an xpath so I can start getting some advert data.