In previous post I have made some quick fix with redis that makes re-loading links a bit faster.
While I was searching for some plugins that I could use in my blog, I've found this plugin source code which uses content_object_init
signal. Let's check if using this will make our plugin even faster.
To The Point
To be precise and have reliable performance checks let's first change our debugging project and add more articles with links to it.
Source code of debugging plugin for performance can be found at this branch.
Comparison
-
Without plugin
Took 0.14 second.
-
With previous plugin (redis-aware)
Took 0.17-0.20 second.
-
With
content_object_init
change.First time:
Took ~10 second.
Second time(and next):
Took ~0.20 second.
Source Code of changed plugin
# -*- coding: utf-8 -*-
""" This is a main script for pelican_link_to_title """
from pelican import signals
from bs4 import BeautifulSoup
import urllib
def link_to_title_plugin(generator):
"Link_to_Title plugin "
article_ahreftag= {}
for article in generator.articles:
soup = BeautifulSoup(article._content, 'html.parser')
ahref_tag = soup.find_all('ahref')
if ahref_tag:
article_ahreftag[article] = (ahref_tag, soup)
for article, (p_tags, soup) in article_ahreftag.items():
for tag in p_tags:
url_page = tag.string
if url_page:
if "http://" in url_page or "https://" in url_page:
tag.name = "a"
tag.string = read_page(url_page)
tag.attrs = {"href": url_page}
else:
continue
article._content = str(soup).decode("utf-8")
def read_page(url_page):
import redis
redconn = redis.Redis(host='localhost', port=6379, db=0)
found = redconn.get(url_page)
if not found:
r = urllib.urlopen(url_page).read()
soup = BeautifulSoup(r , "html.parser")
title = soup.find("title").string
redconn.set(url_page, title)
return title
else:
return found
def content_object_init(instance):
if instance._content is not None:
content = instance._content
soup = BeautifulSoup(content, "html5lib")
for ctbl in soup.find_all('ahref'):
url_page = ctbl.contents[0]
if url_page:
if "http://" in url_page or "https://" in url_page:
ctbl.name = "a"
try:
ctbl.string = read_page(url_page)
except:
pass
ctbl.attrs = {"href": url_page}
instance._content = soup.decode()
# If beautiful soup appended html tags.
if instance._content.startswith('<html>'):
instance._content = instance._content[12:-14]
def register():
""" Registers Plugin """
signals.content_object_init.connect(content_object_init)
# signals.article_generator_finalized.connect(link_to_title_plugin)
Effects
Well as I see the efects are not so much different.
I will need to check if updating urllib into something with parallelism will make it better.
Acknowledgements
Auto-promotion
Related links
- Python Better network API than urllib - Stack Overflow
- Something faster than urllib2.open()? : Python
- time - How can I speed up fetching pages with urllib2 in python? - Stack Overflow
- python - What are the differences between the urllib, urllib2, and requests module? - Stack Overflow
Thanks!
That's it :) Comment, share or don't - up to you.
Any suggestions what I should blog about? Post me a comment in the box below or poke me at Twitter: @anselmos88.
See you in the next episode! Cheers!
Comments
comments powered by Disqus