Add ability to scrape user pages#12
Conversation
meetmangukiya
left a comment
There was a problem hiding this comment.
This is a nice new addition, thank you! Left a detailed review, some are nitpicks here and there, please bear with me. 😅
| @@ -1,3 +1,4 @@ | |||
|
|
|||
| :param short_circuit: | ||
| Whether or not to short_circuit total_count loop | ||
|
|
||
| Yields url, captions, hashtags, and mentions for provided insta url |
There was a problem hiding this comment.
- caption*
- Move this to the top, in the docstring.
| :param existing: | ||
| URLs to skip | ||
| :param short_circuit: | ||
| Whether or not to short_circuit total_count loop |
There was a problem hiding this comment.
dedent lines 26-33 by 4 spaces
| Total number of images to be scraped. | ||
| :param existing: | ||
| URLs to skip | ||
| :param mode |
| List of users to be scraped | ||
| :param total_count: | ||
| total number of images to be scraped | ||
| :param should_continue |
There was a problem hiding this comment.
add colon after should_continue
| existing_links.add(row[1]) | ||
| start = i + 1 | ||
| _single_tag_processing(tag, total_count, existing_links, start) | ||
| print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}') |
There was a problem hiding this comment.
This becomes incorrect, since we are downloading as f'{count}.jpg' which is one less than file_index. Replace count with file_index, better variable name.
| try: | ||
| req = requests.get(url) | ||
| with open(f'data/{tag}/{count}.jpg', 'wb') as img: | ||
| with open(f'data/{target}/{count}.jpg', 'wb') as img: |
There was a problem hiding this comment.
We want the users to be able to distinguish between the user photos, and tag photos, since if I scrape @instagram, I might mistake it for images scraped from instagram tag. So, mode specific data directories. :)
| print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}') | ||
|
|
||
| targets = {'tags': tags, 'users': users} | ||
| for mode,lists in targets.items(): |
|
|
||
| Scrapes user and hashtag images from Instagram | ||
| """ | ||
| def _single_input_processing(target: str, total_count: int, existing_links: set, start: int, mode: str='tag'): |
There was a problem hiding this comment.
Rename this, this is no longer single input processing.
| for i, row in enumerate(reader): | ||
| existing_links.add(row[1]) | ||
| start = i + 1 | ||
| _single_input_processing(target, total_count, existing_links, start, mode=mode) |
There was a problem hiding this comment.
Account the rename here too
Refactored the code so that you can specify both tags and users to be scraped. Also fixed some off by one errors and added more function documentation.