"Side Projects" aka VW-adverts
Wednesday, March 21st, 2018
SOLR 7.2.1 - you are rather good :)
I’m so glad to be using SOLR in this side project.
I can recommend the tutorial, which is excellent.
Saturday, March 10th, 2018
Using Scrapy to develop a basic crawler
Take the xpath from XPathHelper, and use in a simple crawler:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'ssvc'
start_urls = ['https://ssvc.org.uk/phpbb/viewforum.php?f=51']
def parse(self, response):
for h3 in response.xpath('//div[@class="list-inner"]/a[@class="topictitle"]').extract():
yield {"topictitle": h3}
Run the crawler from Intellij by using the following Run / Debug configuration:
Script ~/virtualEnvs/advert/lib/python2.7/site-packages/scrapy/cmdline.py
Parameters runspider crawler_ssvc.py
Use the cmdline.py, with the scrapy parameters (you would use on the command line.
Running gives the following output
{'topictitle': u'<a href="./viewtopic.php?f=51&t=121296&sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">Eh!</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&t=120683&sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1967 RHD Split Screen</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&t=121305&sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1965 splitscreen LHD for sale</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&t=121293&sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">67 SO42 Westy</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
{'topictitle': u'<a href="./viewtopic.php?f=51&t=121245&sid=5f3074b32ef61397a2a554f316f943c9" class="topictitle">1964 VW SplitScreen Single Cab - OG Paint</a>'}
2018-03-10 13:23:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ssvc.org.uk/phpbb/viewforum.php?f=51>
[snip]
The advert title and urls are being parsed :)
XPath is not something I use every day
I have been using the XPathHelper (Chrome plugin) with my xpath queries. It provides a good way of parsing advertisement titles and a href.
February 2018
I’ve been wanting to push my latest side project forward for quite some time. It’s now time to do just that.
I’ve had the idea of developing a Microservices based product which finds “VW adverts” based on keywords and phrases. You register, enter your phrases, and off you go.
The tech stack is Spring Boot, with MongoDB. After some investigation, i’m going to use Python Scrapy for the crawling.
Scrapy
virtualenv -p python3 spider
$ scrapy startproject spider
Traceback (most recent call last):
[snip]
TLSVersion.TLSv1_1: SSL.OP_NO_TLSv1_1,
AttributeError: 'module' object has no attribute 'OP_NO_TLSv1_1'
So this is the first config error. As I want fully automated infrastructure, this will need to be solved. Enough for today :0/
Monday May 30th, 2017
Fixes are easy when you you know how. To fix the OP_NO_TLSv1_1, I had to re-install OpenSSL, and then downgrade the Python module:
pip install Twisted==16.4.1
Scrapy is now working fine :O)
The next task is to get an xpath so I can start getting some advert data.