Skip to content
This repository was archived by the owner on Jan 11, 2022. It is now read-only.

Add ability to scrape user pages#12

Open
xraymemory wants to merge 6 commits intomeetmangukiya:masterfrom
xraymemory:patch-1
Open

Add ability to scrape user pages#12
xraymemory wants to merge 6 commits intomeetmangukiya:masterfrom
xraymemory:patch-1

Conversation

@xraymemory
Copy link
Copy Markdown

Refactored the code so that you can specify both tags and users to be scraped. Also fixed some off by one errors and added more function documentation.

Copy link
Copy Markdown
Owner

@meetmangukiya meetmangukiya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice new addition, thank you! Left a detailed review, some are nitpicks here and there, please bear with me. 😅

Comment thread instagram_scraper.py
@@ -1,3 +1,4 @@

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

umm... why?

Comment thread instagram_scraper.py
:param short_circuit:
Whether or not to short_circuit total_count loop

Yields url, captions, hashtags, and mentions for provided insta url
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. caption*
  2. Move this to the top, in the docstring.

Comment thread instagram_scraper.py
:param existing:
URLs to skip
:param short_circuit:
Whether or not to short_circuit total_count loop
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dedent lines 26-33 by 4 spaces

Comment thread instagram_scraper.py
Total number of images to be scraped.
:param existing:
URLs to skip
:param mode
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a colon after mode

Comment thread instagram_scraper.py
List of users to be scraped
:param total_count:
total number of images to be scraped
:param should_continue
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add colon after should_continue

Comment thread instagram_scraper.py
existing_links.add(row[1])
start = i + 1
_single_tag_processing(tag, total_count, existing_links, start)
print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}')
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This becomes incorrect, since we are downloading as f'{count}.jpg' which is one less than file_index. Replace count with file_index, better variable name.

Comment thread instagram_scraper.py
try:
req = requests.get(url)
with open(f'data/{tag}/{count}.jpg', 'wb') as img:
with open(f'data/{target}/{count}.jpg', 'wb') as img:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want the users to be able to distinguish between the user photos, and tag photos, since if I scrape @instagram, I might mistake it for images scraped from instagram tag. So, mode specific data directories. :)

Comment thread instagram_scraper.py
print(f'[{target}] downloaded {url} as {file_index}.jpg in data/{target}')

targets = {'tags': tags, 'users': users}
for mode,lists in targets.items():
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

space after ,

Comment thread instagram_scraper.py

Scrapes user and hashtag images from Instagram
"""
def _single_input_processing(target: str, total_count: int, existing_links: set, start: int, mode: str='tag'):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rename this, this is no longer single input processing.

Comment thread instagram_scraper.py
for i, row in enumerate(reader):
existing_links.add(row[1])
start = i + 1
_single_input_processing(target, total_count, existing_links, start, mode=mode)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Account the rename here too

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants