Question:
I have a script which scrapes from google alerts using requests and beautifulsoup and stores them in a CSV usingdf.to_csv()
.import requests
from bs4 import BeautifulSoup
import pandas as pd
name = 'galert'
url = 'https://www.google.co.in/alerts/feeds/'
output = []
for entry in soup.find_all('entry'):
item = {
'Title' : entry.find('title',{'type':'html'}).text,
'Pubdate' : entry.find('published').text,
'Content' : entry.find('content').text,
'Link' : entry.find('link')['href']
}
output.append(item)
df = pd.DataFrame(output)
df.to_csv('google_alert.csv',index=False)
print('Saved to google_alert.csv')
How do I run my scraper.py multiple times and continue adding new results to the same CSV file while not repeating any results?The CSV has the following columns: title, pubdate, content and link. I think they can be unique if the data link is checked. But how to add this into the flow of this scraper?
Answer:
You can usedrop_duplicates
:# Load your old entries
df1 = pd.read_csv('google_alert.csv')
# Do stuff here with requests and bs4
...
df2 = pd.DataFrame(output)
df = pd.concat([df1, df2]).drop_duplicates('Link')
df.to_csv('google_alert.csv', index=False)
If you have better answer, please add a comment about this, thank you!
Leave a Review