Announcing a CLI Python package to more easily download individual 3GPP standards documents, or to bulk download sets of documents, as necessary.

Periodically I have to use the 3GPP site to access the standards documents they publish there. The site is well structured and the standards are freely available (yay!), however I’ve consistently found that navigating from the front page to a link to download the document you want can require several false turns and many clicks. More recently I was also in the situation where I just wanted to download “all the radio specifications” and the only solution, using the website, is to manually click through all the download links; tedious at best.

One solution is to somehow record all the download URLs and use wget to download the documents. I decided to use a little more knowledge of the download site structure and implement a utility to facilitate the downloads a little more.

Announcing download_3gpp. It’s available via PyPI and gives you a command download_3gpp that allows you to filter what documents get downloaded. By default (no arguments) it downloads all the latest copies of all releases, all series and all standards.

Here’s the output of the command line help to get you started:

> download_3gpp --help

   usage: download_3gpp [-h] [--base-url BASE_URL] [--destination DESTINATION]
                        [--rel REL] [--series SERIES] [--std STD]

   Acquire 3GPP standards packages from archive

   optional arguments:
     -h, --help            show this help message and exit
     --base-url BASE_URL   Base 3GPP download URL to target, default
                           "https://www.3gpp.org/ftp/Specs/latest/"
     --destination DESTINATION
                           Destination download directory, default "./"
     --rel REL             3GPP release number to target, default "all"
     --series SERIES       3GPP series number to target, default "all"
     --std STD             3GPP standard number to target, default "all"

So how does it work?

If you’re interested in what makes the utility work I’ll explain that a little here. I’ll discuss the CI pipeline and release management for the project in another post.

The 3GPP download site has a basic structure that looks something like this, starting from the “latest” folder:

https://www.3gpp.org/ftp/Specs/latest/
├── readme.txt
└── <dir> Rel-<R>
    └── <dir> <S>_series
        ├── <S><nnn>-*.zip
        .
        .
        .
        └── <S><...>-*.zip

In this case <R> is the 3GPP “Rel” number, <S> is the 3GPP “Series” number and <nnn> is the 3GPP “Std” number. The series folder will contain one or more different standards documents (the zip files) depending on how many are defined for that series; this is where using the main 3GPP site is helpful for explaining what the series and standards are.

There are two libraries fundamental to the operation of the utility; requests and BeautifulSoup. The requests library is responsible for acquiring the HTML page content and BeautifulSoup provides a clean way to access the content of the page, such as navigation and download URLs that this utility is primarily interested in.

The Downloader class and it’s get_files method are the key entry point to acquiring the files specified by the user. I’ve excerpted the code below to highlight the steps involved with some additional comments.

class Downloader:
    def __init__(self, user_options: UserOptions):
        self.__user_options = user_options

    def get_files(self):
        rel_url_data = get_rel_urls(
            self.__user_options.base_url, self.__user_options.rel
        )
        if not rel_url_data:
            # Issue a warning to the user that the "rel" data is empty. This is
            # not necessarily an error since if the user has changed the base
            # URL it is possible for some of those 3GPP snapshot folders to not
            # have any data in them.
            log.warning(...)

        # Iterate over the release ("rel" in 3GPP parlance) subdirectories
        for rel_basename, rel_number, rel_url in rel_url_data:
            this_series_url_data = get_series_urls(rel_url, self.__user_options.series)
            if not this_series_url_data:
                # Issue a warning to the user that the series URL data is empty.
                log.warning(...)

            for series_basename, series_number, series_url in this_series_url_data:
                std_url_data = get_std_urls(
                    series_url, series_number, self.__user_options.std
                )
                if not std_url_data:
                    # Issue a warning to the user that the std URL data is empty.
                    log.warning(...)

                for std_file, std_url in std_url_data:
                    # Formulate the local file path using `os.path.join`
                    local_std_path = ...

                    # Ensure the directory for the file exists, and all
                    # intermediate directories (similar to `mkdir -p`).
                    os.makedirs(os.path.dirname(local_std_path), exist_ok=True)

                    if not os.path.isfile(local_std_path):
                        # Save the remote file content into a local file.
                        log.info("Downloading file, {0}".format(std_url))
                        r = requests.get(std_url)
                        with open(local_std_path, "wb") as this_file:
                            this_file.write(r.content)
                    else:
                        # Warn the user that the file already exists locally
                        # and won't be downloaded again.
                        log.warning(...)

Getting the HTML content for each page in this case, including the final standards document file, is as simple as requests.get(<URL>).

Processing the page for each level in the site hierarchy was unique, but amounted to “look for URLs satisfying the specified regular expression”, embodied by the get_urls and get_patterned_urls functions.

def get_urls(this_soup: BeautifulSoup) -> typing.List[ParseResult]:
    this_urls = list()
    # Use the BeautifulSoup object to find the HTML "A" tags and extract the
    # "href" (URL) data.
    for link in this_soup.find_all("a"):
        this_urls.append(urlparse(link.get("href")))

    return this_urls

def get_patterned_urls(
    index_url: str, regex_pattern: str
) -> typing.List[UrlBasenameData]:
    """
    Recover URL base data from the specified pattern. Assumes there is an integer value specified
    in ``regex_pattern`` that must also be recovered for the URL "value".
    """
    index_soup = BeautifulSoup(get_index(index_url), "html.parser")
    page_urls = get_urls(index_soup)

    filtered_urls = list()
    for this_url in page_urls:
        this_basename = url_basename(this_url)
        match_result = re.match(regex_pattern, this_basename)

        if match_result is not None:
            this_value = int(match_result.group(1))

            this_result = UrlBasenameData(
                basename=this_basename, value=this_value, url=this_url
            )

            filtered_urls.append(this_result)

    return filtered_urls

Conclusion

A fairly simple utility that didn’t take very long to implement; the core implementation including unit tests was approximately two man days. The CI pipeline took a little more because I was using some new tools and experimenting a bit with new Gitlab-CI features; more on that later…

Since this is a web scraping utility it is brittle to any changes 3GPP might make to the folder and file hierarchy, but it probably won’t be too difficult to adapt to any changes at that time. There’s also some obvious tweaks now to improve the utility, such as

  • accommodating the “archive” folder on the 3GPP site where the document names are folders not zip files, and the folder contains all the historical revisions of that standard.
  • specifying the “base” folder as a filter with default “latest”. The user could then just specify the timestamp they want (or “archive”) to acquire historical snapshots.