Scraping News and Articles From Public APIs with Python

Whether you are data scientist, programmer or AI specialist, you surely can put huge number of news articles to some good use. Getting those articles can be challenging though as you will have to go through quite a few hoops to get to the actual data - finding the right news sources, exploring their APIs, figuring out how to authenticate against them and finally scraping the data. That's a lot of work and no fun.

So, to save you some time and get you started, here's list of public news APIs that I was able to find, with explanation how authenticate against them, query them and most importantly examples for how to get all the data you need from them!

NY Times

First and the best source of data is in my opinion New Your Times. To start using its API you need to create an account at https://developer.nytimes.com/accounts/create and an application at https://developer.nytimes.com/my-apps/new-app. When creating the application you get to choose which APIs to activate - I recommend activating at least Most Popular, Article Search, Top Stories and Archive APIs. When your application is created you will be presented with the key which you will use to interact all the selected APIs, so copy it and let's start querying!

The simplest query we can do with NY Times API is look up for current top stories:


import requests
import os
from pprint import pprint

apikey = os.getenv('NYTIMES_APIKEY', '...')

# Top Stories:
# https://developer.nytimes.com/docs/top-stories-product/1/overview
section = "science"
query_url = f"https://api.nytimes.com/svc/topstories/v2/{section}.json?api-key={apikey}"

r = requests.get(query_url)
pprint(r.json())

The snippet above is very straightforward. We run a GET request against topstories/v2 endpoint supplying section name and our API key. Section in this case is science, but NY Times provides a lot of other options here, e.g. fashion, health, sports or theater. Full list can be found here. This specific request would produce response that would look something like this:


{ 'last_updated': '2020-08-09T08:07:44-04:00',
 'num_results': 25,
 'results': [{'abstract': 'New Zealand marked 100 days with no new reported '
                          'cases of local coronavirus transmission. France '
                          'will require people to wear masks in crowded '
                          'outdoor areas.',
              'byline': '',
              'created_date': '2020-08-09T08:00:12-04:00',
              'item_type': 'Article',
              'multimedia': [{'caption': '',
                              'copyright': 'The New York Times',
                              'format': 'superJumbo',
                              'height': 1080,
                              'subtype': 'photo',
                              'type': 'image',
                              'url': 'https://static01.nyt.com/images/2020/08/03/us/us-briefing-promo-image-print/us-briefing-promo-image-superJumbo.jpg',
                              'width': 1920},
                             ],
              'published_date': '2020-08-09T08:00:12-04:00',
              'section': 'world',
              'short_url': 'https://nyti.ms/3gH9NXP',
              'title': 'Coronavirus Live Updates: DeWine Stresses Tests’ '
                       'Value, Even After His False Positive',
              'uri': 'nyt://article/27dd9f30-ad63-52fe-95ab-1eba3d6a553b',
              'url': 'https://www.nytimes.com/2020/08/09/world/coronavirus-covid-19.html'},
             ]
 }

Next and probably the most useful endpoint when you are trying to get some specific set of data is the article search endpoint:


# Article Search:
# https://api.nytimes.com/svc/search/v2/articlesearch.json?q=<QUERY>&api-key=<APIKEY>
# Use - https://developer.nytimes.com/docs/articlesearch-product/1/routes/articlesearch.json/get to explore API

query = "politics"
begin_date = "20200701"  # YYYYMMDD
filter_query = "\"body:(\"Trump\") AND glocations:(\"WASHINGTON\")\""  # http://www.lucenetutorial.com/lucene-query-syntax.html
page = "0"  # <0-100>
sort = "relevance"  # newest, oldest
query_url = f"https://api.nytimes.com/svc/search/v2/articlesearch.json?" \
            f"q={query}" \
            f"&api-key={apikey}" \
            f"&begin_date={begin_date}" \
            f"&fq={filter_query}" \
            f"&page={page}" \
            f"&sort={sort}"

r = requests.get(query_url)
pprint(r.json())

This endpoint features lots of filtering options. The only mandatory field is q (query), which is the search term. Beyond that you can mix and match filter query, date range (begin_date, end_date), page number, sort order and facet fields. The filter query (fq) is interesting one, as it allows use of Lucene query syntax, which can be used to create complex filters with logical operators (AND, OR), negations or wildcards. Nice tutorial can be found here.

Example response for above query might like this (some fields were removed for clarity):


{'response': {'docs': [{'_id': 'nyt://article/0bf06be1-6699-527f-acb0-09fdd8abb6f6',
                        'abstract': 'The president sidestepped Congress when it became clear that his nominee for a '
                                    'top Defense Department position would not win Senate approval.',
                        'byline': {'original': 'By Helene Cooper'},
                        'document_type': 'article',
                        'headline': {'main': 'Trump Puts Pentagon in Political Crossfire With Tata Appointment',
                                     'print_headline': 'Bypassing Congress to Appoint Ally, Trump Puts Pentagon in Political Crossfire'},
                        'keywords': [{'major': 'N', 'name': 'subject', 'rank': 1,
                                      'value': 'United States Politics and Government'},
                                     {'major': 'N', 'name': 'subject', 'rank': 2,
                                      'value': 'Appointments and Executive Changes'},
                                     {'major': 'N', 'name': 'subject', 'rank': 3,
                                      'value': 'Presidential Election of 2020'}],
                        'lead_paragraph': 'WASHINGTON — In making an end run around Congress to appoint Anthony J. Tata, a retired brigadier '
                                          'general with a history of Islamophobic and other inflammatory views, to a top Defense Department '
                                          'post, President Trump has once again put the military exactly where it does not want to be: in '
                                          'the middle of a political battle that could hurt bipartisan support for the Pentagon.',
                        'multimedia': [],
                        'news_desk': 'Washington',
                        'pub_date': '2020-08-03T21:19:00+0000',
                        'section_name': 'U.S.',
                        'source': 'The New York Times',
                        'subsection_name': 'Politics',
                        'type_of_material': 'News',
                        'uri': 'nyt://article/0bf06be1-6699-527f-acb0-09fdd8abb6f6',
                        'web_url': 'https://www.nytimes.com/2020/08/03/us/politics/tata-pentagon.html',
                        'word_count': 927}]}}

Last endpoint for NY Times that I will show here is their Archive API which returns list of articles for given month going back all the way to 1851! This can be very useful if you need bulk data and don't really need to search for specific terms.


# Archive Search
# https://developer.nytimes.com/docs/archive-product/1/overview

year = "1852"  # <1851 - 2020>
month = "6"  # <1 - 12>
query_url = f"https://api.nytimes.com/svc/archive/v1/{year}/{month}.json?api-key={apikey}"

r = requests.get(query_url)
pprint(r.json())

The query above searches for all articles from June of 1852 and from the result below we can see that even though we search for really old articles we still got 1888 hits. That said, most of these lack most of the useful data like keywords, word counts, author, etc. so you are probably better off searching for little more recent articles.


{'response': {
        'meta': {'hits': 1888},
        'docs': [{'_id': 'nyt://article/fada2905-0108-54a9-8729-ae9cda8b9528',
                        'byline': {'organization': None, 'original': None, 'person': []},
                        'document_type': 'article',
                        'headline': {'content_kicker': None, 'kicker': '1',
                                     'main': 'Sentence for Manslaughter.',
                                     'name': None,
                                     'print_headline': 'Sentence for Manslaughter.'},
                        'keywords': [], 'news_desk': 'None',
                        'print_page': '3',
                        'pub_date': '1852-06-29T05:00:00+0000',
                        'section_name': 'Archives',
                        'source': 'The New York Times',
                        'type_of_material': 'Archives',
                        'uri': 'nyt://article/fada2905-0108-54a9-8729-ae9cda8b9528',
                        'web_url': 'https://www.nytimes.com/1852/06/29/archives/sentence-for-manslaughter.html',
                        'word_count': 0},
                ...]}
}

These were just some of the (in my opinion) more useful APIs provided by NY Times. Beside these, there are bunch more available at https://developer.nytimes.com/apis. To explore each API, I would also recommend playing with query builder like the one for article search, which lets you and build and execute your test query right on the website without any coding.

The Guardian

Next up is another great source of news and articles - The Guardian. Same as with NY Times, we first need to sign up for an API key. You can do so at https://bonobo.capi.gutools.co.uk/register/developer and you will receive your key in an email. With that out of the way, we can navigate to API documentation and start querying the API.

Let's start simply by querying content sections of The Guardian:


# https://open-platform.theguardian.com/documentation/section
query = "science"
query_url = f"https://content.guardianapis.com/sections?" \
            f"api-key={apikey}" \

r = requests.get(query_url)
pprint(r.json())

{'response': {'results': [{'apiUrl': 'https://content.guardianapis.com/science',
                           'editions': [{'apiUrl': 'https://content.guardianapis.com/science',
                                         'code': 'default',
                                         'id': 'science',
                                         'webTitle': 'Science',
                                         'webUrl': 'https://www.theguardian.com/science'}],
                           'id': 'science',
                           'webTitle': 'Science',
                           'webUrl': 'https://www.theguardian.com/science'}],
              'status': 'ok',
              'total': 1,
              'userTier': 'developer'}}

These sections group content into topics, which can be useful if you are looking for specific type of content, e.g. science or technology. If we omit the query (q) parameter, we will instead receive full list of sections, which is about 75 records.

Moving on to something little more interesting - searching by tags:


# https://open-platform.theguardian.com/documentation/tag
query = "weather"
section = "news"
page = "1"
query_url = f"http://content.guardianapis.com/tags?" \
            f"api-key={apikey}" \
            f"&q={query}" \
            f"&page={page}"

r = requests.get(query_url)
pprint(r.json())

{'response': {'currentPage': 1,
              'pageSize': 10,
              'pages': 139,
              'results': [
                          {'apiUrl': 'https://content.guardianapis.com/australia-news/australia-weather',
                           'id': 'australia-news/australia-weather',
                           'sectionId': 'australia-news',
                           'sectionName': 'Australia news',
                           'type': 'keyword',
                           'webTitle': 'Australia weather',
                           'webUrl': 'https://www.theguardian.com/australia-news/australia-weather'},
                          {'apiUrl': 'https://content.guardianapis.com/world/extreme-weather',
                           'id': 'world/extreme-weather',
                           'sectionId': 'world',
                           'sectionName': 'World news',
                           'type': 'keyword',
                           'webTitle': 'Extreme weather',
                           'webUrl': 'https://www.theguardian.com/world/extreme-weather'},
                          ],
              'startIndex': 1,
              'status': 'ok',
              'total': 1385,
              'userTier': 'developer'}}

This query looks quite similar to the previous one and also returns similar kinds of data. Tags also group content into categories, but there are a lot more tags (around 50000) than sections. Each of these tags have structure like for example world/extreme-weather. These are very useful when doing search for actual articles, which is what we will do next.

The one thing you really came here for is article search and for that we will use https://open-platform.theguardian.com/documentation/search:


query = "(hurricane OR storm)"
query_fields = "body"
section = "news"  # https://open-platform.theguardian.com/documentation/section
tag = "world/extreme-weather"  # https://open-platform.theguardian.com/documentation/tag
from_date = "2019-01-01"
query_url = f"https://content.guardianapis.com/search?" \
            f"api-key={apikey}" \
            f"&q={query}" \
            f"&query-fields={query_fields}" \
            f"§ion={section}" \
            f"&tag={tag}" \
            f"&from-date={from_date}" \
            f"&show-fields=headline,byline,starRating,shortUrl"

r = requests.get(query_url)
pprint(r.json())

The reason I first showed you section and tag search is that those can be used in the article search. Above you can see that we used section and tag parameters to narrow down our search, which values can be found using previously shown queries. Apart from these parameters, we also included the obvious q parameter for our search query, but also starting date using from-date as well as show-fields parameter, which allows us to request extra fields related to the content - in this case those would be headline, byline, rating and shortened URL. There's bunch more of those with full list available here.

And as with all the previous ones, here is example response:


{'response': {'currentPage': 1, 'orderBy': 'relevance', 'pageSize': 10, 'pages': 1,
              'results': [{'apiUrl': 'https://content.guardianapis.com/news/2019/dec/19/weatherwatch-storms-hit-france-and-iceland-as-australia-overheats',
                           'fields': {'byline': 'Daniel Gardner (MetDesk)',
                                      'headline': 'Weatherwatch: storms hit France and Iceland as Australia overheats',
                                      'shortUrl': 'https://gu.com/p/dv4dq'},
                           'id': 'news/2019/dec/19/weatherwatch-storms-hit-france-and-iceland-as-australia-overheats',
                           'pillarId': 'pillar/news',
                           'sectionId': 'news',
                           'type': 'article',
                           'webPublicationDate': '2019-12-19T11:33:52Z',
                           'webTitle': 'Weatherwatch: storms hit France and '
                                       'Iceland as Australia overheats',
                           'webUrl': 'https://www.theguardian.com/news/2019/dec/19/weatherwatch-storms-hit-france-and-iceland-as-australia-overheats'},
                          {'apiUrl': 'https://content.guardianapis.com/news/2020/jan/31/weatherwatch-how-repeated-flooding-can-shift-levees',
                           'fields': {'byline': 'David Hambling',
                                      'headline': 'Weatherwatch: how repeated '
                                                  'flooding can shift levees',
                                      'shortUrl': 'https://gu.com/p/d755m'},
                           'id': 'news/2020/jan/31/weatherwatch-how-repeated-flooding-can-shift-levees',
                           'pillarId': 'pillar/news',
                           'sectionId': 'news',
                           'type': 'article',
                           'webPublicationDate': '2020-01-31T21:30:00Z',
                           'webTitle': 'Weatherwatch: how repeated flooding can shift levees',
                           'webUrl': 'https://www.theguardian.com/news/2020/jan/31/weatherwatch-how-repeated-flooding-can-shift-levees'}],
              'startIndex': 1, 'status': 'ok', 'total': 7, 'userTier': 'developer'}}

HackerNews

For more tech oriented source of news, one might turn to HackerNews, which also has its public REST API. It's documented on https://github.com/HackerNews/API. This API, as you will see, is in version v0 and is currently very bare-bones, meaning it doesn't really provide specific endpoints to - for example - query articles, comments or users.

But even though it's very basic it still provides all that's necessary to, for example, get top stories:


query_type = "top"  # top/best/new, also ask/show/job
query_url = f"https://hacker-news.firebaseio.com/v0/{query_type}stories.json?print=pretty"  # Top Stories
r = requests.get(query_url)
ids = r.json()

top = ids[:10]
for story in top:
    query_url = f"https://hacker-news.firebaseio.com/v0/item/{story}.json?print=pretty"
    r = requests.get(query_url)
    pprint(r.json())

The snippet above is not nearly as obvious as the previous ones, so let's look at it more closely. We first send request to API endpoint (v0/topstories), which doesn't return top stories as you would expect, but really just their IDs. To get the actual stories we take these IDs (first 10 of them) and send requests to v0/item/<ID> endpoint which returns data for each of these individual items, which in this case happens to be a story.

You surely noticed that the query URL was parametrized with query_type. That's because, HackerNews API also has similar endpoints for all the top sections of the website, that being - ask, show, job or new.

One nice thing about this API is that it doesn't require authenticate, so you don't need to request API key and don't need to worry about rate limiting like with the other ones.

Running this code would land response that looks something like this:


{'by': 'rkwz',
 'descendants': 217,
 'id': 24120311,
 'kids': [24122571,
          ...,
          24121481],
 'score': 412,
 'time': 1597154451,
 'title': 'Single Page Applications using Rust',
 'type': 'story',
 'url': 'http://www.sheshbabu.com/posts/rust-wasm-yew-single-page-application/'}
{'by': 'bmgoss',
 'descendants': 5,
 'id': 24123372,
 'kids': [24123579, 24124181, 24123545, 24123929],
 'score': 55,
 'time': 1597168165,
 'title': 'Classic Books for Tech Leads (or those aspiring to be)',
 'type': 'story',
 'url': 'https://sourcelevel.io/blog/3-classic-books-for-tech-leads-or-those-aspiring-to-be'}
{'by': 'adamnemecek',
 'descendants': 7,
 'id': 24123283,
 'kids': [24123803, 24123774, 24124106, 24123609],
 'score': 69,
 'time': 1597167845,
 'title': 'Bevy: Simple, data-driven, wgpu-based game engine in Rust',
 'type': 'story',
 'url': 'https://bevyengine.org'}

If you found an interesting articles and wanted to dig a little deeper, then HackerNews API can help with that too. You can find comments of each submission by traversing kids field of said story. Code that would do just that looks like so:


first = 24120311  # Top story
query_url = f"https://hacker-news.firebaseio.com/v0/item/{first}.json?print=pretty"
r = requests.get(query_url)
comment_ids = r.json()["kids"]  # IDs of top level comments of first story

for i in comment_ids[:10]:  # Print first 10 comments of story
    query_url = f"https://hacker-news.firebaseio.com/v0/item/{i}.json?print=pretty"
    r = requests.get(query_url)
    pprint(r.json())

First, we look up story (item) by ID like we did in previous example. We then iterate over its kids and run same query with respective IDs retrieving items that in this case refer to story comments. We could also go through these recursively if we wanted to build whole tree/thread of comments of specific story.

As always, here is sample response:


{'by': 'Naac',
 'id': 24123455,
 'kids': [24123485],
 'parent': 24120311,
 'text': 'So as I understand it Rust is compelling because it is a safer '
         'alternative to C++ ( and sometimes C but mainly a C++ replacement '
         ').<p>We wouldn't usually create a single page app in C++ right? '
         'So why would we want to do that in Rust ( other than, "just '
         'because" ). Right tool for the right job and all that.',
 'time': 1597168558,
 'type': 'comment'}
{'by': 'intelleak',
 'id': 24123860,
 'parent': 24120311,
 'text': 'I've been hearing good things about zig, and someone mentioned '
         'that zig has better wasm support than rust, is it true? I wish rust '
         'had a js ecosystem too ...',
 'time': 1597170320,
 'type': 'comment'}
{'by': 'praveenperera',
 'id': 24120642,
 'kids': [24120867, 24120738, 24120940, 24120721],
 'parent': 24120311,
 'text': 'Great post.<p>I'd love to see one talking about building a full '
         'stack app using Yew and Actix (or Rocket). And good ways of sharing '
         'types between the frontend and the backend.',
 'time': 1597156315,
 'type': 'comment'}
{'by': 'devxpy',
 'id': 24122583,
 'kids': [24122721, 24122756, 24122723],
 'parent': 24120311,
 'text': 'Can anyone please tell me how the author able to use html syntax in '
         'rust?<p>I get that there are macros, but how are html tags valid '
         'syntax? Is rust just interpreting the html content as '
         'strings?<p>I've only ever seen C macros, and I don't '
         'remember seeing\n'
         ' this kind of wizardry happening there.',
 'time': 1597165060,
 'type': 'comment'}

Currents

Finding popular and good quality news API is quite difficult as most classic newspapers don't have free public API. There are however, sources of aggregate news data which can be used to get articles and news from newspapers like for example Financial Times and Bloomberg which only provide paid API services or like CNN doesn't expose any API at all.

One of these aggregators is called Currents API. It aggregates data from thousands of sources, 18 languages and over 70 countries and it's also free.

It's similar to the APIs shown before. We again need to first get API key. To do so, you need to register at https://currentsapi.services/en/register. After that, go to your profile at https://currentsapi.services/en/profile and retrieve your API token.

With key (token) ready we can request some data. There's really just one interesting endpoint and that's https://api.currentsapi.services/v1/search:


# https://currentsapi.services/en/docs/search
apikey = os.getenv('CURRENTS_APIKEY', '...')
category = "business"
language = languages['English']  # Mapping from Language to Code, e.g.: "English": "en"
country = regions["Canada"]  # Mapping from Country to Code, e.g.: "Canada": "CA",
keywords = "bitcoin"
t = "1"  # 1 for news, 2 for article and 3 for discussion content
domain = "financialpost.com"  # website primary domain name (without www or blog prefix)
start_date = "2020-06-01T14:30"  # YYYY-MM-DDTHH:MM:SS+00:00
query_url = f"https://api.currentsapi.services/v1/search?" \
            f"apiKey={apikey}" \
            f"&language={language}" \
            f"&category={category}" \
            f"&country={country}" \
            f"&type={t}" \
            f"&domain={domain}" \
            f"&keywords={keywords}" \
            f"&start_date={start_date}"

r = requests.get(query_url)
pprint(r.json())

This endpoint includes lots of filtering options including language, category, country and more, as shown in the snippet above. All of those are pretty self-explanatory, but for those first three I mentioned, you will need some extra information as their possible values aren't really obvious. These values come from API endpoints available here and in case of languages and regions are really just mappings of value to code (e.g. "English": "en") and in case of categories just a list of possible values. It's omitted in the code above to make it a bit shorter, but I just copied these mappings into Python dicts to avoid calling API every time.

Response to above request lands the following:


{'news': [{'author': 'Bloomberg News',
           'category': ['business'],
           'description': '(Bloomberg) — Bitcoin is notoriously volatile, prone to sudden price surges and swift reversals '
                          'that can wipe out millions of dollars of value in a matter of minutes. Those changes are often...',
           'id': 'cb50963e-73d6-4a21-bb76-ec8bc8b9c201',
           'image': 'https://financialpostcom.files.wordpress.com/2017/11/fp-512x512.png',
           'language': 'ru',
           'published': '2020-04-25 05:02:50 +0000',
           'title': 'Get Set for Bitcoin ‘Halving’! Here’s What That Means',
           'url': 'https://business.financialpost.com/pmn/business-pmn/get-set-for-bitcoin-halving-heres-what-that-means'},
          {'author': 'Reuters',
           'category': ['business'],
           'description': 'NEW YORK — Crushing asset sell-offs ranging from bitcoin to precious metals and European stocks '
                          'accompanied Wall Street’s slide into bear market territory on Thursday, as investors liqu…',
           'id': '3c75b090-ec7d-423e-9487-85becd92d10c',
           'image': 'https://financialpostcom.files.wordpress.com/2017/11/fp-512x512.png',
           'language': 'en',
           'published': '2020-03-12 23:14:18 +0000',
           'title': 'Wall Street sell-off batters bitcoin, pounds palladium as '
                    'investors go to cash',
           'url': 'https://business.financialpost.com/pmn/business-pmn/wall-street-sell-off-batters-bitcoin-pounds-palladium-as-investors-go-to-cash'}],
 'page': 1,
 'status': 'ok'}

If you aren't searching for specific topic or historical data, then there's one other options which Currents API provides - the latest news endpoint:


language = languages['English']
query_url = f"https://api.currentsapi.services/v1/latest-news?" \
            f"apiKey={apikey}" \
            f"&language={language}"

r = requests.get(query_url)
pprint(r.json())

It is very similar to the search endpoint, this one however only provides language parameter and produces results like this:


{'news': [{'author': 'Isaac Chotiner',
           'category': ['funny'],
           'description': 'The former U.S. Poet Laureate discusses her decision to tell her mother\'s story in prose, in '
                          'her new book, "Memorial Drive," and her feelings about the destruction of Confederate monuments...',
           'id': '3ded3ed1-ecb8-41db-96d3-dc284f4a61de',
           'image': 'https://media.newyorker.com/photos/5f330eba567fa2363b1a19c3/16:9/w_1280,c_limit/Chotiner-NatashaTrethewey.jpg',
           'language': 'en',
           'published': '2020-08-12 19:15:03 +0000',
           'title': 'How Natasha Trethewey Remembers Her Mother',
           'url': 'https://www.newyorker.com/culture/q-and-a/how-natasha-trethewey-remembers-her-mother'},
          {'author': '@BBCNews',
           'category': ['regional'],
           'description': 'Firefighters are tackling the blaze that broke out in the engineering department at the university...',
           'id': '9e1f1ee2-8041-4864-8cca-0ffaedf9ae2b',
           'image': 'https://ichef.bbci.co.uk/images/ic/1024x576/p08ngy6g.jpg',
           'language': 'en',
           'published': '2020-08-12 18:37:48 +0000',
           'title': "Fire at Swansea University's Bay campus",
           'url': 'https://www.bbc.co.uk/news/uk-wales-53759352'}],
 'page': 1,
 'status': 'ok'}

Conclusion

There are many great news sites and online newspapers out there on the internet, but in most cases you won't be able to scrape their data or access them programmatically. The ones shown in this article are the rare few with nice API and free access that you can use for your next project whether it's some data science, machine learning or simple news aggregator. If you don't mind paying some money for news API, you might also consider using Financial Times or Bloomberg. Apart from APIs you can also try scraping HTML and parsing the content yourself with something like BeautifulSoup. If you happen to find any other good source of news data, please let me know, so that I can add it to this list. 🙂

I'm currently looking for a new role. If you're hiring, feel free to reach out at martin7.heinz@gmail.com or on LinkedIn.

Subscribe: