Motivation
Last week I detailed how and why nflgame broke. Now I’ll be going over how I intend to fix it.
Intended Audience
- Anyone using nflgame that is interested in contributing.
- Anyone interested in legally accessing nfl.com for it’s data
- Anyone interested in a pragmatic async python example
Objectives
- Scaffold project for scraping nfl.com data
- Scrape schedule data for 2020 season
- Save it in a json format nflgame can consume
Legality
We are allowed to use any public information for non-commericial use and we are still in line with their terms. Although, it’s possible this is available for commercial use if HiQ Vs LinkedIn can be taken as a precedent. I’m unsure myself, so I’ve abandoned all commercial plans for nflgame.
NFL.com terms
#1 Copyright
You may use the Services and the contents contained in the Services solely for your own individual non-commercial and informational purposes only.
We are allowed to use the data presented for personal non-commercial use
#12 Prohibited Use
f)use or attempt to use any engine, software, tool, agent or other device or mechanism (including, without limitation, browsers, spiders, robots, avatars or intelligent agents) to navigate or search the Services to harvest or otherwise collect information from the Services to be used for any commercial purpose;
We can even use spiders/robots or whatever the hell a avatar is.
Security Violations: c) attempting to interfere with service to any user, host or network, including, without limitation, by way of submitting a virus to, or overloading, “flooding”, “spamming”, “mail bombing” or “crashing”, the Services;
We can’t DDoS nfl.com
e) forging any TCP/IP packet header or any part of the header information in any e-mail or posting. Violations of system or network security may result in civil or criminal liability.
We may investigate occurrences that may involve violations of the security of the Services or of the law and we may involve, and cooperate with, law enforcement authorities in prosecuting users who are involved in such violations.
Dont touch the headers boys and girls.
Getting Started
The most important part about this code is how many requests we make per second. We shouldn’t abuse nfl.com, but we want to scrape this data as fast as we can. In fact, this is the only requirement of this effort; DONT ABUSE NFL.COM. In order to re-seed the data in nflgame thousands of requests are necessary. With Async python it’s possible to execute this in seconds… this is a very easy way to get blacklisted.
So that being said, lets get into ways we can throttle our requests and responsibly scrape the data and remain ToS friendly.
The Tools
Python 3.7+. Asynchonous Python was introduced in python 3.4 but it’s evolved quite a bit so going forward version 3.7+ is going to be required for nflgame. This is the ideal way for us to responsibly, reliably and quickly make concurrent requests to nfl.com.
Also several 3rd party dependencies;
- aiohttp - an async request framework.
- asyncio-throttle - Simple async throttle
- aiofiles- async file access
- beautifulsoup4 - parse html
Setup
Make sure to pip -r requirements.txt
aiohttp
aiofiles
beautifulsoup4
pytest
asyncio-throttle
** All the imports you’ll be using **
import json
import sys
import logging
import time
import asyncio
import os
from bs4 import BeautifulSoup
import aiofiles
import aiohttp
from aiohttp import ClientSession
from asyncio_throttle import Throttler
Basic Logging
A crucial aspect of writing async code, in general, is logging. Logging gives us a really good sense of whats happening when.
import logging
logging.basicConfig(
format="%(asctime)s %(levelname)s:%(name)s: %(message)s",
level=logging.DEBUG,
datefmt="%H:%M:%S",
stream=sys.stderr,
)
logger = logging.getLogger("nfl-scrape-async")
logging.getLogger("chardet.charsetprober").disabled = True
This is envoked with logger.info("your log message")
. You can checkout the python docs for logging for more information on how to use the python logger.
Building a list of urls
There is no easy way to navigate the nfl’s schedules by following href’s. The component to navigate is a drop down menu and contains no reference to the various weeks. I’m not willing to commit to something like selenium quite yet, so lets build the urls!
There are historically;
- 5 weeks of pre season (starting @ 0)
- 17 weeks of regular season (starting @ 1)
- 4 weeks of post season (starting @ 1 + ignoring pro bowl)
SEASON_PHASES = (
('PRE', range(0, 4)),
('REG', range(1,17+1)),
('POST', range(1, 4+1)),
)
String template for the url(s) we will be probing
URL_STR = "https://www.nfl.com/schedules/{year}/{phase}{week_num}"
Function that takes a year and spits out a list of formatted urls
def build_urls(year):
urls = []
for phase_dict in SEASON_PHASES:
for week_num in phase_dict[1]:
urls.append(URL_STR.format(year=year, phase=phase_dict[0], week_num=week_num))
return urls
Running python -c "from ScheduleSpider import build_urls; build_urls(2020)"
will print…
15:39:15 INFO:nfl-scrape-async: Adding https://www.nfl.com/schedules/2020/PRE0 to urls
15:39:15 INFO:nfl-scrape-async: Adding https://www.nfl.com/schedules/2020/PRE1 to urls
15:39:15 INFO:nfl-scrape-async: Adding https://www.nfl.com/schedules/2020/PRE2 to urls
.... etc
Scaffolding The Async Code
I’d be dishonest if I told you this next was easy for me. I attempted to build some kind of rate limiting component but quite frankly I’m still learning the pattern to leverage async code in python. I ended up just using asyncio-throttle because its demo code worked out of the box.
async def main():
"""Main program for setting up task and throttler
"""
throttler = Throttler(rate_limit=3, period=3)
async with ClientSession() as session:
urls = build_urls(2020)
# While this is only one task now, we can later add more tasks (fetchGameData() or w/e).
tasks = [loop.create_task(worker(throttler, session, urls))]
await asyncio.wait(tasks)
async def worker(throttler, session, urls):
logger.info("Worker fetching {} urls".format(len(urls)))
for url in urls:
async with throttler:
print (url)
# this line allows us to import this spider and use your own loop
if __name__ == "__main__":
assert sys.version_info >= (3, 7), "Script requires Python 3.7+."
loop = asyncio.get_event_loop()
loop.run_until_complete(main())
loop.close()
After executing python ScheduleSpider.py
you should see the throttler at work.
Making asynchronous http requests
Now that we aren’t at risk of flooding nfl.com with async requests… we can actually make some calls.
async def fetch_html(url: str, session: ClientSession, **kwargs) -> str:
"""GET request wrapper to fetch page HTML.
kwargs are passed to `session.request()`.
"""
resp = await session.request(method="GET", url=url, **kwargs)
resp.raise_for_status()
logger.info("Got response [%s] for URL: %s", resp.status, url)
html = await resp.text()
return html
Modify the worker to call fetch_html
async def worker(throttler, session, urls):
data = list()
logger.info("Worker fetching {} urls".format(len(urls)))
for url in urls:
async with throttler:
data += await fetch_html(url, session)
data += await fetch_html(url, session)
Now running ScheduleSpider.py will be pinging nfl.com in groups of 3, at no more than every 3 seconds.
Parsing the nfl schedule pages
Now for a parsing method. In the future we will be refactoring functions like parse, fetch_html, etc into its own class. ScheduleSpider will invoke these. For now, it will only parse games from the schedule pages.
async def parse(url: str, session: ClientSession, **kwargs) -> set:
games = []
# Try to make a request
try:
html = await fetch_html(url=url, session=session, **kwargs)
except (
aiohttp.ClientError,
aiohttp.http_exceptions.HttpProcessingError,
) as e:
# Log the aiohttp exception details thrown by fetch_html
logger.error(
"aiohttp exception for %s [%s]: %s",
url,
getattr(e, "status", None),
getattr(e, "message", None),
)
return games
except Exception as e:
# Log all non aiohttp exceptions
logger.exception(
"Non-aiohttp exception occured: %s", getattr(e, "__dict__", {})
)
return games
else:
# Instantiate soup
page_soup = BeautifulSoup(html, 'html.parser')
# Games are grouped by day...
matchup_groups = page_soup.select("section.nfl-o-matchup-group")
for group in matchup_groups:
# Grab the date string for each group
datestr = group.find("h2", class_="d3-o-section-title").get_text()
# Each group of games has a container...
game_strips = group.select("div.nfl-c-matchup-strip")
for strip in game_strips:
# Format the data...
team_abbv = strip.select("span.nfl-c-matchup-strip__team-abbreviation")
g = {
"date": datestr,
"time": strip.select("span.nfl-c-matchup-strip__date-time")[0].get_text().strip(),
"away": team_abbv[0].get_text().strip(),
"home": team_abbv[1].get_text().strip(),
}
logger.info("Game Details:")
logger.info(g)
logger.info("------------")
games += [g]
logger.info("Found %d links for %s", len(games), url)
return games
Now swap out the fetch_html() method w parse()
... rest of worker code...
data += await parse(url, session)
Now python ScheduleSpider.py
will print out the details for every game in the list as they are recieved.
Saving schedule data
File IO is blocking. Eventually we will be writing multiple files so lets take a moment to implement non-blocking file IO with aiofiles.
async def write_results(data: dict) -> None:
""" Writes data to schedule.json
"""
outfile = os.path.join(os.path.dirname(__file__), 'schedule.json')
json_str = json.dumps(data)
async with aiofiles.open(outfile, "w+") as f:
await f.write(json_str)
logger.info("Wrote results for source URLs")
Update 9/9/2021
I’ve since abandoned all efforts of maintaining a easily accessible nfl api/data. If you are interested in taking this code forward… check out the next steps!
Next Steps for this code
I made several compromises here that will need to be addressed before nflgame can make use of this;
1) Multiple Data Streams
We want to implement a way to pull roster and play-by-play data. This will necessitate some refactoring in terms of how and when we execute async calls. Play by play data, for example, is dependent on the schedule in that we will only be polling these pages when the game is active. When the game is active we will want to be updating the data in short increments (exact timing is TBD).
2) Missing IDs
If you area an advanced nflgame users you will notice that this does not make a complete schedule that nflgame can understand. The schedule does not have any eid - the key nflgame uses to identify games. This is likely a massive fundamental problem with how we are gathering this data, namely, how do we garner the unique identifiers nflgame uses for plays, players and games.
3) To mutate or not
Currently the entire schedule is replaced every time. I am unsure if I should be mutating these data sets or not. I’ll probably just treat this data as immutable, but who knows. A problem for future me.