|
| 1 | +============= |
| 2 | +ImageResolver |
| 3 | +============= |
| 4 | + |
| 5 | +A python clone of ImageResolver for finding significant images in HTML content |
| 6 | +See the excellent JS version at: https://github.com/mauricesvay/ImageResolver |
| 7 | + |
| 8 | +USAGE |
| 9 | +----- |
| 10 | + |
| 11 | +:: |
| 12 | + |
| 13 | + import imageresolver |
| 14 | + import sys |
| 15 | + |
| 16 | + try: |
| 17 | + i = imageresolver.ImageResolver() |
| 18 | + i.register(imageresolver.FileExtensionResolver()) |
| 19 | + i.register(imageresolver.ImgurPageResolver()) |
| 20 | + i.register(imageresolver.WebpageResolver(load_images=True, parser='lxml',blacklist='easylist.txt')) |
| 21 | + url = sys.argv[1] |
| 22 | + |
| 23 | + print i.resolve(url) |
| 24 | + except: |
| 25 | + print "An error occured" |
| 26 | + |
| 27 | +Differences From the Javascript Version |
| 28 | +--------------------------------------- |
| 29 | + |
| 30 | +* methods return instead of calling callbacks |
| 31 | + |
| 32 | +* WebpageResolver has lots of new options (see below) |
| 33 | + |
| 34 | +* Added some debugging features |
| 35 | + |
| 36 | +* Exceptions are raised rather than callback to an error function |
| 37 | + |
| 38 | +WebpageResolver Additions |
| 39 | +------------------------- |
| 40 | + |
| 41 | +* rules syntax is now based on AdBlockPlus filters (https://adblockplus.org/en/filters) |
| 42 | + |
| 43 | +* New rules can be added without writing a resolver |
| 44 | + |
| 45 | +* blacklist image sources and whitelist |
| 46 | + |
| 47 | +* Loads as little of the image as possible when fetching for image info. Stops downloading if diminsions are found or a setable limit is reached. |
| 48 | + |
| 49 | +* The original rules from the JS version are still implemented. (see options) |
| 50 | + |
| 51 | +ImageResolver() METHODS |
| 52 | +----------------------- |
| 53 | + |
| 54 | +**__init__** *(\*\*kwargs)* |
| 55 | + |
| 56 | +Keyword options |
| 57 | + |
| 58 | + * *max_read_size* - set to the maximum amount of bytes to read to find the width and height of an image. Default `10240` |
| 59 | + * *chunk_size* - set to the chunk size to read Default `1024` |
| 60 | + * *read_all* - set to read the entire image and then detect its info. Option will override max_read_size. Default `False` |
| 61 | + * *debug* - set to enable debugging output (logger="ImageResolver"). Default `False` |
| 62 | + |
| 63 | +**fetch** *(string url)* |
| 64 | + |
| 65 | +Fetches a URL and returns the response data. |
| 66 | + |
| 67 | +**fetch_image_info** *(string url)* |
| 68 | + |
| 69 | +Fetches an image url and examines the resulting image. Returns a tuple consisting of the detected file extension, the width and the height of the image. |
| 70 | + |
| 71 | +**register** *(instance filter)* |
| 72 | + |
| 73 | +Register a filter to examine an image with. The filter argument must be an instance of a class that has a `resolve()` method. `resolve()` must accept a string URL and must return a url or `None` |
| 74 | + |
| 75 | +**resolve** *(string url)* |
| 76 | + |
| 77 | +Loop through each registered filter until a url is resolved by one of them. If no url is found, returns `None` |
| 78 | + |
| 79 | + |
| 80 | +FileExtensionResolver() METHODS |
| 81 | +------------------------------- |
| 82 | + |
| 83 | +**resolve** *(string url)* |
| 84 | + |
| 85 | +Returns the url if the extention matches a possible image |
| 86 | + |
| 87 | +ImgurPageResolver() METHODS |
| 88 | +--------------------------- |
| 89 | + |
| 90 | +**resolve** *(string url)* |
| 91 | + |
| 92 | +Returns an Imgur image url if `url` matches the pattern of an Imgur page |
| 93 | + |
| 94 | +WebpageResolver() METHODS |
| 95 | +------------------------- |
| 96 | + |
| 97 | +The work-horse of this module. Our uses revolve mostly around this filter and thus it is the |
| 98 | +most feature complete and tested. |
| 99 | + |
| 100 | +**__init__** *(\*\*kwargs)* |
| 101 | + |
| 102 | +Initialize the class with options: |
| 103 | + |
| 104 | + * *load_image* - set to true to load the first 1k of images whose size is not set in HTML. Default `False` |
| 105 | + * *use_js_ruleset* - set to true to use the original rules from the Javascript version. Default `False` |
| 106 | + * *use_adblock_filters* - set to false to disable adblock filters. Default `True` |
| 107 | + * *parser* - set to a BeautifulSoup compatable parser (lxml is recommended). Default `html.parser` |
| 108 | + * *blacklist* - set to a file containing AdBlockPlus style filters used to lower an image's score. Default `blacklist.txt` |
| 109 | + * *whiltelist* - set to a file containing AdBlockPlus style filters used to raise an image's score. Default `whitelist.txt` |
| 110 | + * *significant_surface* - Amount of surface (width x height) of the image required to add additional scoring |
| 111 | + * *boost_jpeg* - add (int) boost score to JPEG files. Default `1` |
| 112 | + * *boost_gif* - add (int) boost score to GIF files. Default `0` |
| 113 | + * *boost_png* - add (int) boost score to PNG files. Default `0` |
| 114 | + * *skip_fetch_errors* - Skip exceptions raised by fetch_image_info(). Exceptions are logged and the image will be skipped. Default `True` |
| 115 | + |
| 116 | +The default parser for BeautifulSoup is html.parser which is built-in to python. We *highly* recommend you install lxml and pass parser="lxml" |
| 117 | +to WebpageResolver(). In our testing we found that it was much faster and more accurate. |
| 118 | + |
| 119 | +LOGGING |
| 120 | +------- |
| 121 | + |
| 122 | +Use the name "ImageResolver" to configure a logger. Skipped exceptions will be logged to this logger's error output and when enabled, debugging output as well. |
| 123 | + |
| 124 | +EXCEPTIONS |
| 125 | +---------- |
| 126 | + |
| 127 | +**ImageInfoException** |
| 128 | + |
| 129 | +Raised if the image could not be read or type, width or height properties return undefined. |
| 130 | +By default this exception is skipped and logged but can be enabled with "skip_fetch_errors=False" option in WebpageResolver |
| 131 | + |
| 132 | +**HTTPException** |
| 133 | + |
| 134 | +Raised if the image could not be loaded from the URL. |
| 135 | +By default this exception is skipped and logged but can be enabled with "skip_fetch_errors=False" option in WebpageResolver |
| 136 | + |
| 137 | +TODO |
| 138 | +----------------- |
| 139 | + |
| 140 | +Still missing the following resolvers: |
| 141 | + |
| 142 | +* ImgurAlbumResolver() |
| 143 | + |
| 144 | +* FlickrResolver() |
| 145 | + |
| 146 | +* OpengraphResolver() |
| 147 | + |
| 148 | +* InstagramResolver() |
| 149 | + |
| 150 | +I have no plans to implement a 9gag resolver. |
| 151 | + |
| 152 | +Need to implement better caching. Future plan is to include a configurable cache method so images seen across sessions can be cached for better performance |
| 153 | + |
| 154 | + |
| 155 | +AUTHOR |
| 156 | +------ |
| 157 | + |
| 158 | +Chris Brown |
| 159 | + |
| 160 | +BUGS |
| 161 | +---- |
| 162 | + |
| 163 | +Probably. Send us an email or a patch if you find one |
| 164 | + |
| 165 | +COPYRIGHT / ACKNOWLEDGEMENTS |
| 166 | +---------------------------- |
| 167 | + |
| 168 | +(c) 2014 Constituent Voice, LLC. |
| 169 | + |
| 170 | +Original idea and basic setup came from Maurice Svay https://github.com/mauricesvay/ImageResolver |
| 171 | + |
| 172 | +Image detection came from the bfg-pages project https://code.google.com/p/bfg-pages/ |
| 173 | + |
| 174 | +Reading AdBlock Plus filters forked from https://github.com/wildgarden/abpy |
| 175 | + |
| 176 | +LICENSE |
| 177 | +------- |
| 178 | + |
| 179 | +Some of the source libraries are licensed with the BSD license. To avoid license messiness we've chosen to release this software as BSD as well. |
| 180 | +The easylist.txt provided by AdBlockPlus is licensed as GPL and it should be updated regularly anyway. For these reasons we have chosen not to |
| 181 | +include the file in the package. You can pass it as the "blacklist" or "whitelist" parameter to the Webpageresolver |
| 182 | + |
| 183 | + |
0 commit comments