Announcing a CLI Python package to more easily download individual 3GPP standards documents, or to bulk download sets of documents, as necessary.
Periodically I have to use the 3GPP site to access the standards documents they publish there. The site is well structured and the standards are freely available (yay!), however I’ve consistently found that navigating from the front page to a link to download the document you want can require several false turns and many clicks. More recently I was also in the situation where I just wanted to download “all the radio specifications” and the only solution, using the website, is to manually click through all the download links; tedious at best.
One solution is to somehow record all the download URLs and use wget
to
download the documents. I decided to use a little more knowledge of the
download site structure and implement a utility to facilitate the downloads a
little more.
Announcing download_3gpp
.
It’s available via PyPI and gives you a command download_3gpp
that allows you
to filter what documents get downloaded. By default (no arguments) it downloads
all the latest copies of all releases, all series and all standards.
Here’s the output of the command line help to get you started:
> download_3gpp --help
usage: download_3gpp [-h] [--base-url BASE_URL] [--destination DESTINATION]
[--rel REL] [--series SERIES] [--std STD]
Acquire 3GPP standards packages from archive
optional arguments:
-h, --help show this help message and exit
--base-url BASE_URL Base 3GPP download URL to target, default
"https://www.3gpp.org/ftp/Specs/latest/"
--destination DESTINATION
Destination download directory, default "./"
--rel REL 3GPP release number to target, default "all"
--series SERIES 3GPP series number to target, default "all"
--std STD 3GPP standard number to target, default "all"
So how does it work?
If you’re interested in what makes the utility work I’ll explain that a little here. I’ll discuss the CI pipeline and release management for the project in another post.
The 3GPP download site has a basic structure that looks something like this, starting from the “latest” folder:
https://www.3gpp.org/ftp/Specs/latest/
├── readme.txt
└── <dir> Rel-<R>
└── <dir> <S>_series
├── <S><nnn>-*.zip
.
.
.
└── <S><...>-*.zip
In this case <R>
is the 3GPP “Rel” number, <S>
is the 3GPP “Series” number
and <nnn>
is the 3GPP “Std” number. The series folder will contain one or
more different standards documents (the zip files) depending on how many are
defined for that series; this is where using the main 3GPP site is helpful for
explaining what the series and standards are.
There are two libraries fundamental to the operation of the utility;
requests
and
BeautifulSoup
. The
requests library is responsible for acquiring the HTML page content and
BeautifulSoup provides a clean way to access the content of the page, such as
navigation and download URLs that this utility is primarily interested in.
The Downloader
class
and it’s get_files
method are the key entry point to acquiring the files
specified by the user. I’ve excerpted the code below to highlight the steps
involved with some additional comments.
class Downloader:
def __init__(self, user_options: UserOptions):
self.__user_options = user_options
def get_files(self):
rel_url_data = get_rel_urls(
self.__user_options.base_url, self.__user_options.rel
)
if not rel_url_data:
# Issue a warning to the user that the "rel" data is empty. This is
# not necessarily an error since if the user has changed the base
# URL it is possible for some of those 3GPP snapshot folders to not
# have any data in them.
log.warning(...)
# Iterate over the release ("rel" in 3GPP parlance) subdirectories
for rel_basename, rel_number, rel_url in rel_url_data:
this_series_url_data = get_series_urls(rel_url, self.__user_options.series)
if not this_series_url_data:
# Issue a warning to the user that the series URL data is empty.
log.warning(...)
for series_basename, series_number, series_url in this_series_url_data:
std_url_data = get_std_urls(
series_url, series_number, self.__user_options.std
)
if not std_url_data:
# Issue a warning to the user that the std URL data is empty.
log.warning(...)
for std_file, std_url in std_url_data:
# Formulate the local file path using `os.path.join`
local_std_path = ...
# Ensure the directory for the file exists, and all
# intermediate directories (similar to `mkdir -p`).
os.makedirs(os.path.dirname(local_std_path), exist_ok=True)
if not os.path.isfile(local_std_path):
# Save the remote file content into a local file.
log.info("Downloading file, {0}".format(std_url))
r = requests.get(std_url)
with open(local_std_path, "wb") as this_file:
this_file.write(r.content)
else:
# Warn the user that the file already exists locally
# and won't be downloaded again.
log.warning(...)
Getting the HTML content for each page in this case, including the final
standards document file, is as simple as requests.get(<URL>)
.
Processing the page for each level in the site hierarchy was unique, but
amounted to “look for URLs satisfying the specified regular expression”,
embodied by the get_urls
and get_patterned_urls
functions.
def get_urls(this_soup: BeautifulSoup) -> typing.List[ParseResult]:
this_urls = list()
# Use the BeautifulSoup object to find the HTML "A" tags and extract the
# "href" (URL) data.
for link in this_soup.find_all("a"):
this_urls.append(urlparse(link.get("href")))
return this_urls
def get_patterned_urls(
index_url: str, regex_pattern: str
) -> typing.List[UrlBasenameData]:
"""
Recover URL base data from the specified pattern. Assumes there is an integer value specified
in ``regex_pattern`` that must also be recovered for the URL "value".
"""
index_soup = BeautifulSoup(get_index(index_url), "html.parser")
page_urls = get_urls(index_soup)
filtered_urls = list()
for this_url in page_urls:
this_basename = url_basename(this_url)
match_result = re.match(regex_pattern, this_basename)
if match_result is not None:
this_value = int(match_result.group(1))
this_result = UrlBasenameData(
basename=this_basename, value=this_value, url=this_url
)
filtered_urls.append(this_result)
return filtered_urls
Conclusion
A fairly simple utility that didn’t take very long to implement; the core implementation including unit tests was approximately two man days. The CI pipeline took a little more because I was using some new tools and experimenting a bit with new Gitlab-CI features; more on that later…
Since this is a web scraping utility it is brittle to any changes 3GPP might make to the folder and file hierarchy, but it probably won’t be too difficult to adapt to any changes at that time. There’s also some obvious tweaks now to improve the utility, such as
- accommodating the “archive” folder on the 3GPP site where the document names are folders not zip files, and the folder contains all the historical revisions of that standard.
- specifying the “base” folder as a filter with default “latest”. The user could then just specify the timestamp they want (or “archive”) to acquire historical snapshots.