Motivation

Last week I detailed how and why nflgame broke. Now I’ll be going over how I intend to fix it.

Intended Audience

  • Anyone using nflgame that is interested in contributing.
  • Anyone interested in legally accessing nfl.com for it’s data
  • Anyone interested in a pragmatic async python example

Objectives

  1. Scaffold project for scraping nfl.com data
  2. Scrape schedule data for 2020 season
  3. Save it in a json format nflgame can consume

Legality

We are allowed to use any public information for non-commericial use and we are still in line with their terms. Although, it’s possible this is available for commercial use if HiQ Vs LinkedIn can be taken as a precedent. I’m unsure myself, so I’ve abandoned all commercial plans for nflgame.

NFL.com terms

You may use the Services and the contents contained in the Services solely for your own individual non-commercial and informational purposes only.

We are allowed to use the data presented for personal non-commercial use


Dave Chappelle's avatar would still be black

#12 Prohibited Use

f)use or attempt to use any engine, software, tool, agent or other device or mechanism (including, without limitation, browsers, spiders, robots, avatars or intelligent agents) to navigate or search the Services to harvest or otherwise collect information from the Services to be used for any commercial purpose;

We can even use spiders/robots or whatever the hell a avatar is.


Security Violations: c) attempting to interfere with service to any user, host or network, including, without limitation, by way of submitting a virus to, or overloading, “flooding”, “spamming”, “mail bombing” or “crashing”, the Services;

We can’t DDoS nfl.com


e) forging any TCP/IP packet header or any part of the header information in any e-mail or posting. Violations of system or network security may result in civil or criminal liability.

We may investigate occurrences that may involve violations of the security of the Services or of the law and we may involve, and cooperate with, law enforcement authorities in prosecuting users who are involved in such violations.

Dont touch the headers boys and girls.


Getting Started

The most important part about this code is how many requests we make per second. We shouldn’t abuse nfl.com, but we want to scrape this data as fast as we can. In fact, this is the only requirement of this effort; DONT ABUSE NFL.COM. In order to re-seed the data in nflgame thousands of requests are necessary. With Async python it’s possible to execute this in seconds… this is a very easy way to get blacklisted.

So that being said, lets get into ways we can throttle our requests and responsibly scrape the data and remain ToS friendly.

The Tools

Python 3.7+. Asynchonous Python was introduced in python 3.4 but it’s evolved quite a bit so going forward version 3.7+ is going to be required for nflgame. This is the ideal way for us to responsibly, reliably and quickly make concurrent requests to nfl.com.

Also several 3rd party dependencies;

Setup

Make sure to pip -r requirements.txt

aiohttp
aiofiles
beautifulsoup4
pytest
asyncio-throttle

** All the imports you’ll be using **

import json
import sys
import logging
import time
import asyncio
import os

from bs4 import BeautifulSoup

import aiofiles
import aiohttp
from aiohttp import ClientSession
from asyncio_throttle import Throttler

Basic Logging

A crucial aspect of writing async code, in general, is logging. Logging gives us a really good sense of whats happening when.

import logging

logging.basicConfig(
    format="%(asctime)s %(levelname)s:%(name)s: %(message)s",
    level=logging.DEBUG,
    datefmt="%H:%M:%S",
    stream=sys.stderr,
)
logger = logging.getLogger("nfl-scrape-async")
logging.getLogger("chardet.charsetprober").disabled = True

This is envoked with logger.info("your log message"). You can checkout the python docs for logging for more information on how to use the python logger.

Building a list of urls

There is no easy way to navigate the nfl’s schedules by following href’s. The component to navigate is a drop down menu and contains no reference to the various weeks. I’m not willing to commit to something like selenium quite yet, so lets build the urls!

There are historically;

  • 5 weeks of pre season (starting @ 0)
  • 17 weeks of regular season (starting @ 1)
  • 4 weeks of post season (starting @ 1 + ignoring pro bowl)
SEASON_PHASES = (
            ('PRE', range(0, 4)),
            ('REG', range(1,17+1)),
            ('POST', range(1, 4+1)),
        )

String template for the url(s) we will be probing

URL_STR = "https://www.nfl.com/schedules/{year}/{phase}{week_num}"

Function that takes a year and spits out a list of formatted urls

def build_urls(year):
    urls = []
    for phase_dict in SEASON_PHASES:
        for week_num in phase_dict[1]:
            urls.append(URL_STR.format(year=year, phase=phase_dict[0], week_num=week_num))
    return urls

Running python -c "from ScheduleSpider import build_urls; build_urls(2020)" will print…

15:39:15 INFO:nfl-scrape-async: Adding https://www.nfl.com/schedules/2020/PRE0 to urls
15:39:15 INFO:nfl-scrape-async: Adding https://www.nfl.com/schedules/2020/PRE1 to urls
15:39:15 INFO:nfl-scrape-async: Adding https://www.nfl.com/schedules/2020/PRE2 to urls

.... etc

Scaffolding The Async Code

I’d be dishonest if I told you this next was easy for me. I attempted to build some kind of rate limiting component but quite frankly I’m still learning the pattern to leverage async code in python. I ended up just using asyncio-throttle because its demo code worked out of the box.

async def main():
    """Main program for setting up task and throttler
    """
    throttler = Throttler(rate_limit=3, period=3)
    async with ClientSession() as session:
        urls = build_urls(2020)
        # While this is only one task now, we can later add more tasks (fetchGameData() or w/e).
        tasks = [loop.create_task(worker(throttler, session, urls))]
        await asyncio.wait(tasks)

async def worker(throttler, session, urls):
    logger.info("Worker fetching {} urls".format(len(urls)))
    for url in urls:
        async with throttler:
            print (url)

# this line allows us to import this spider and use your own loop
if __name__ == "__main__":
    assert sys.version_info >= (3, 7), "Script requires Python 3.7+."
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())
    loop.close()

After executing python ScheduleSpider.py you should see the throttler at work.

Async-Throttle Demo

Making asynchronous http requests

Now that we aren’t at risk of flooding nfl.com with async requests… we can actually make some calls.

async def fetch_html(url: str, session: ClientSession, **kwargs) -> str:
    """GET request wrapper to fetch page HTML.

    kwargs are passed to `session.request()`.
    """
    resp = await session.request(method="GET", url=url, **kwargs)
    resp.raise_for_status()
    logger.info("Got response [%s] for URL: %s", resp.status, url)
    html = await resp.text()
    return html

Modify the worker to call fetch_html

async def worker(throttler, session, urls):
    data = list()
    logger.info("Worker fetching {} urls".format(len(urls)))
    for url in urls:
        async with throttler:
            data += await fetch_html(url, session)
    data += await fetch_html(url, session)

Now running ScheduleSpider.py will be pinging nfl.com in groups of 3, at no more than every 3 seconds.

Aiohttp Throttled Requests

Parsing the nfl schedule pages

Now for a parsing method. In the future we will be refactoring functions like parse, fetch_html, etc into its own class. ScheduleSpider will invoke these. For now, it will only parse games from the schedule pages.

async def parse(url: str, session: ClientSession, **kwargs) -> set:
    games = []
    # Try to make a request
    try:
        html = await fetch_html(url=url, session=session, **kwargs)
    except (
        aiohttp.ClientError,
        aiohttp.http_exceptions.HttpProcessingError,
    ) as e:
        # Log the aiohttp exception details thrown by fetch_html
        logger.error(
            "aiohttp exception for %s [%s]: %s",
            url,
            getattr(e, "status", None),
            getattr(e, "message", None),
        )
        return games
    except Exception as e:
        # Log all non aiohttp exceptions
        logger.exception(
            "Non-aiohttp exception occured:  %s", getattr(e, "__dict__", {})
        )
        return games
    else:
        # Instantiate soup
        page_soup = BeautifulSoup(html, 'html.parser')

        # Games are grouped by day...
        matchup_groups = page_soup.select("section.nfl-o-matchup-group")

        for group in matchup_groups:
            # Grab the date string for each group
            datestr = group.find("h2", class_="d3-o-section-title").get_text()

            # Each group of games has a container...
            game_strips = group.select("div.nfl-c-matchup-strip")
            for strip in game_strips:
                # Format the data...
                team_abbv = strip.select("span.nfl-c-matchup-strip__team-abbreviation")
                g = {
                        "date": datestr,
                        "time": strip.select("span.nfl-c-matchup-strip__date-time")[0].get_text().strip(),
                        "away": team_abbv[0].get_text().strip(),
                        "home": team_abbv[1].get_text().strip(),
                    }
                logger.info("Game Details:")
                logger.info(g)
                logger.info("------------")
                games += [g]

        logger.info("Found %d links for %s", len(games), url)
        return games

Now swap out the fetch_html() method w parse()

... rest of worker code...
data += await parse(url, session)

Now python ScheduleSpider.py will print out the details for every game in the list as they are recieved.

NFL data scraping

Saving schedule data

File IO is blocking. Eventually we will be writing multiple files so lets take a moment to implement non-blocking file IO with aiofiles.

async def write_results(data: dict) -> None:
    """ Writes data to schedule.json
    """
    outfile = os.path.join(os.path.dirname(__file__), 'schedule.json')
    json_str = json.dumps(data)
    async with aiofiles.open(outfile, "w+") as f:
        await f.write(json_str)
        logger.info("Wrote results for source URLs")

Update 9/9/2021

I’ve since abandoned all efforts of maintaining a easily accessible nfl api/data. If you are interested in taking this code forward… check out the next steps!

Next Steps for this code

I made several compromises here that will need to be addressed before nflgame can make use of this;

1) Multiple Data Streams

We want to implement a way to pull roster and play-by-play data. This will necessitate some refactoring in terms of how and when we execute async calls. Play by play data, for example, is dependent on the schedule in that we will only be polling these pages when the game is active. When the game is active we will want to be updating the data in short increments (exact timing is TBD).

2) Missing IDs

If you area an advanced nflgame users you will notice that this does not make a complete schedule that nflgame can understand. The schedule does not have any eid - the key nflgame uses to identify games. This is likely a massive fundamental problem with how we are gathering this data, namely, how do we garner the unique identifiers nflgame uses for plays, players and games.

3) To mutate or not

Currently the entire schedule is replaced every time. I am unsure if I should be mutating these data sets or not. I’ll probably just treat this data as immutable, but who knows. A problem for future me.