Skip to content

Commit 52caab5

Browse files
committed
updating docs and copyright
1 parent 620e10a commit 52caab5

4 files changed

Lines changed: 191 additions & 139 deletions

File tree

LICENSE.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
Copyright (c) 2013, National Write Your Congressman
1+
Copyright (c) 2014, Constituent Voice, LLC.
22
All rights reserved.
33

44
Redistribution and use in source and binary forms, with or without modification,

README.rst

Lines changed: 183 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,183 @@
1+
=============
2+
ImageResolver
3+
=============
4+
5+
A python clone of ImageResolver for finding significant images in HTML content
6+
See the excellent JS version at: https://github.com/mauricesvay/ImageResolver
7+
8+
USAGE
9+
-----
10+
11+
::
12+
13+
import imageresolver
14+
import sys
15+
16+
try:
17+
i = imageresolver.ImageResolver()
18+
i.register(imageresolver.FileExtensionResolver())
19+
i.register(imageresolver.ImgurPageResolver())
20+
i.register(imageresolver.WebpageResolver(load_images=True, parser='lxml',blacklist='easylist.txt'))
21+
url = sys.argv[1]
22+
23+
print i.resolve(url)
24+
except:
25+
print "An error occured"
26+
27+
Differences From the Javascript Version
28+
---------------------------------------
29+
30+
* methods return instead of calling callbacks
31+
32+
* WebpageResolver has lots of new options (see below)
33+
34+
* Added some debugging features
35+
36+
* Exceptions are raised rather than callback to an error function
37+
38+
WebpageResolver Additions
39+
-------------------------
40+
41+
* rules syntax is now based on AdBlockPlus filters (https://adblockplus.org/en/filters)
42+
43+
* New rules can be added without writing a resolver
44+
45+
* blacklist image sources and whitelist
46+
47+
* Loads as little of the image as possible when fetching for image info. Stops downloading if diminsions are found or a setable limit is reached.
48+
49+
* The original rules from the JS version are still implemented. (see options)
50+
51+
ImageResolver() METHODS
52+
-----------------------
53+
54+
**__init__** *(\*\*kwargs)*
55+
56+
Keyword options
57+
58+
* *max_read_size* - set to the maximum amount of bytes to read to find the width and height of an image. Default `10240`
59+
* *chunk_size* - set to the chunk size to read Default `1024`
60+
* *read_all* - set to read the entire image and then detect its info. Option will override max_read_size. Default `False`
61+
* *debug* - set to enable debugging output (logger="ImageResolver"). Default `False`
62+
63+
**fetch** *(string url)*
64+
65+
Fetches a URL and returns the response data.
66+
67+
**fetch_image_info** *(string url)*
68+
69+
Fetches an image url and examines the resulting image. Returns a tuple consisting of the detected file extension, the width and the height of the image.
70+
71+
**register** *(instance filter)*
72+
73+
Register a filter to examine an image with. The filter argument must be an instance of a class that has a `resolve()` method. `resolve()` must accept a string URL and must return a url or `None`
74+
75+
**resolve** *(string url)*
76+
77+
Loop through each registered filter until a url is resolved by one of them. If no url is found, returns `None`
78+
79+
80+
FileExtensionResolver() METHODS
81+
-------------------------------
82+
83+
**resolve** *(string url)*
84+
85+
Returns the url if the extention matches a possible image
86+
87+
ImgurPageResolver() METHODS
88+
---------------------------
89+
90+
**resolve** *(string url)*
91+
92+
Returns an Imgur image url if `url` matches the pattern of an Imgur page
93+
94+
WebpageResolver() METHODS
95+
-------------------------
96+
97+
The work-horse of this module. Our uses revolve mostly around this filter and thus it is the
98+
most feature complete and tested.
99+
100+
**__init__** *(\*\*kwargs)*
101+
102+
Initialize the class with options:
103+
104+
* *load_image* - set to true to load the first 1k of images whose size is not set in HTML. Default `False`
105+
* *use_js_ruleset* - set to true to use the original rules from the Javascript version. Default `False`
106+
* *use_adblock_filters* - set to false to disable adblock filters. Default `True`
107+
* *parser* - set to a BeautifulSoup compatable parser (lxml is recommended). Default `html.parser`
108+
* *blacklist* - set to a file containing AdBlockPlus style filters used to lower an image's score. Default `blacklist.txt`
109+
* *whiltelist* - set to a file containing AdBlockPlus style filters used to raise an image's score. Default `whitelist.txt`
110+
* *significant_surface* - Amount of surface (width x height) of the image required to add additional scoring
111+
* *boost_jpeg* - add (int) boost score to JPEG files. Default `1`
112+
* *boost_gif* - add (int) boost score to GIF files. Default `0`
113+
* *boost_png* - add (int) boost score to PNG files. Default `0`
114+
* *skip_fetch_errors* - Skip exceptions raised by fetch_image_info(). Exceptions are logged and the image will be skipped. Default `True`
115+
116+
The default parser for BeautifulSoup is html.parser which is built-in to python. We *highly* recommend you install lxml and pass parser="lxml"
117+
to WebpageResolver(). In our testing we found that it was much faster and more accurate.
118+
119+
LOGGING
120+
-------
121+
122+
Use the name "ImageResolver" to configure a logger. Skipped exceptions will be logged to this logger's error output and when enabled, debugging output as well.
123+
124+
EXCEPTIONS
125+
----------
126+
127+
**ImageInfoException**
128+
129+
Raised if the image could not be read or type, width or height properties return undefined.
130+
By default this exception is skipped and logged but can be enabled with "skip_fetch_errors=False" option in WebpageResolver
131+
132+
**HTTPException**
133+
134+
Raised if the image could not be loaded from the URL.
135+
By default this exception is skipped and logged but can be enabled with "skip_fetch_errors=False" option in WebpageResolver
136+
137+
TODO
138+
-----------------
139+
140+
Still missing the following resolvers:
141+
142+
* ImgurAlbumResolver()
143+
144+
* FlickrResolver()
145+
146+
* OpengraphResolver()
147+
148+
* InstagramResolver()
149+
150+
I have no plans to implement a 9gag resolver.
151+
152+
Need to implement better caching. Future plan is to include a configurable cache method so images seen across sessions can be cached for better performance
153+
154+
155+
AUTHOR
156+
------
157+
158+
Chris Brown
159+
160+
BUGS
161+
----
162+
163+
Probably. Send us an email or a patch if you find one
164+
165+
COPYRIGHT / ACKNOWLEDGEMENTS
166+
----------------------------
167+
168+
(c) 2014 Constituent Voice, LLC.
169+
170+
Original idea and basic setup came from Maurice Svay https://github.com/mauricesvay/ImageResolver
171+
172+
Image detection came from the bfg-pages project https://code.google.com/p/bfg-pages/
173+
174+
Reading AdBlock Plus filters forked from https://github.com/wildgarden/abpy
175+
176+
LICENSE
177+
-------
178+
179+
Some of the source libraries are licensed with the BSD license. To avoid license messiness we've chosen to release this software as BSD as well.
180+
The easylist.txt provided by AdBlockPlus is licensed as GPL and it should be updated regularly anyway. For these reasons we have chosen not to
181+
include the file in the package. You can pass it as the "blacklist" or "whitelist" parameter to the Webpageresolver
182+
183+

README.txt

Lines changed: 0 additions & 130 deletions
This file was deleted.

imageresolver/__init__.py

Lines changed: 7 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
"""
2-
ImageResolver.py
3-
Copyright 2013 National Write Your Congressman
2+
ImageResolver
3+
Copyright 2014 Constituent Voice
44
5-
ImageResolver.py is a port of the excellent ImageResolver
5+
ImageResolver is a port of the excellent ImageResolver
66
javascript library by Maurice Svay
77
https://github.com/mauricesvay/ImageResolver
88
"""
@@ -38,7 +38,7 @@ class HTTPException(Exception):
3838
class ImageInfoException(Exception):
3939
pass
4040

41-
class ImageResolver():
41+
class ImageResolver(object):
4242

4343
def __init__(self,**kwargs):
4444
self.filters = [];
@@ -52,7 +52,6 @@ def __init__(self,**kwargs):
5252

5353
# read the entire image before trying to get its information
5454
self.read_all = kwargs.get('read_all',False)
55-
5655

5756
if self.debug:
5857
logger.setLevel(logging.DEBUG)
@@ -137,7 +136,7 @@ def resolve(self,url):
137136
if resp:
138137
return resp
139138

140-
class FileExtensionResolver():
139+
class FileExtensionResolver(object):
141140
def resolve(self,url,**kwargs):
142141
logger.debug('Resolving using file extension ' + str(url))
143142
parsed = urlparse(url)
@@ -148,7 +147,7 @@ def resolve(self,url,**kwargs):
148147

149148
return None
150149

151-
class ImgurPageResolver():
150+
class ImgurPageResolver(object):
152151
# works a little different than the JS version.
153152
# it should drop references to galleries and find the image
154153
# could be buggy!
@@ -160,7 +159,7 @@ def resolve(self,url,**kwargs):
160159

161160
return None
162161

163-
class WebpageResolver():
162+
class WebpageResolver(object):
164163
def __init__(self,**kwargs):
165164
self.load_images = kwargs.get('load_images',False)
166165
self.use_js_ruleset = kwargs.get('use_js_ruleset',False)

0 commit comments

Comments
 (0)