Resolved: How to store unique results from scraping an RSS feed?


I have a script which scrapes from google alerts using requests and beautifulsoup and stores them in a CSV using df.to_csv().
import requests
from bs4 import BeautifulSoup
import pandas as pd

name = 'galert'
url = ''
output = []
for entry in soup.find_all('entry'):

    item = {
        'Title' : entry.find('title',{'type':'html'}).text,
        'Pubdate' : entry.find('published').text,
        'Content' : entry.find('content').text,
        'Link' : entry.find('link')['href']


df = pd.DataFrame(output)
print('Saved to google_alert.csv')
How do I run my multiple times and continue adding new results to the same CSV file while not repeating any results?
The CSV has the following columns: title, pubdate, content and link. I think they can be unique if the data link is checked. But how to add this into the flow of this scraper?


You can use drop_duplicates:
# Load your old entries
df1 = pd.read_csv('google_alert.csv')

# Do stuff here with requests and bs4
df2 = pd.DataFrame(output)

df = pd.concat([df1, df2]).drop_duplicates('Link')
df.to_csv('google_alert.csv', index=False)

