A SPDX case study
At LinuxCon North America in New Orleans, Samsung's Young-taek Kim described his company's experience rolling out support for the Software Package Data Exchange (SPDX) standard in its product development tools. SPDX, of course, is a data format for tracking software components, licenses, and copyrights. The company was able to improve its efficiency regarding license compliance, but that was not the only benefit to the program. The implementation team also came away from the experience with feedback for several ways to improve the SPDX specification itself.
Why SPDX
Kim is an engineer in Samsung's Open Source Initiative (OSI) team. Like the open-source groups inside many large corporations, the team is charged (in addition to its development duties) with educating and guiding other units in the company about open source principles. Kim gave a quick overview of SPDX before describing the OSI team's task and where SPDX fit into Samsung's workflow.
The SPDX specification is designed to produce a standardized "bill of materials" for an open source software package, he said. It communicates the licenses and copyrights that make up a package—including, importantly, packages that are derived from multiple sources. A constant problem in business scenarios is making sure that one's company gets good information about these factors from software suppliers and subcontractors. It is common, he said, for a supplier to say simply "this is open source" and provide no further information. The package could be MIT-licensed or under the GPL, but if one does not know which of those licenses it is, one does not know how to comply with it.
$ sudo subscribe todaySubscribe today and elevate your LWN privileges. You’ll have access to all of LWN’s high-quality articles as soon as they’re published, and help support LWN in the process. Act now and you can start with a free trial subscription.
In practice, Kim said, he has often manually vetted a package by looking through the source. He does not mind this process, but it clearly results in duplication of effort when multiple project teams in multiple divisions repeat the vetting for a package that is already in use elsewhere. Standardizing the license and copyright information with SPDX lets the company create a central database to unambiguously keep track of the packages it has already vetted, and it helps resolve complex compliance questions that arise from combining multiple packages. Both benefits were of interest to Samsung.
Samsung's pilot program
Kim explained that Samsung wanted to reduce the overhead of license compliance, so it charged the OSI team with deploying SPDX data interchange in a pilot program inside the company. He then described Samsung's existing open source compliance process. The company breaks the process into four steps: discovering an open source package, developing a product with the package, verifying the obligations imposed by the open source license, and releasing the appropriate material to satisfy that obligation.
The SPDX pilot program was charged with improving those final two steps. Before the program, the verification stage meant confirming the license on a package by having a human read through the source, which is time-consuming, often redundant (such as when the same package has already been verified by a different product team), and prone to error. Human beings, he said, can reach different conclusions when reading the same code. The obligation-satisfaction stage was also largely manual (e.g., a person having to post source code on a public Samsung web site, make it available to customers, or insert a copyright statement onto a product screen) and could be expensive (especially when printing a source code offer in a user manual was involved—and even more expensive when re-printing is necessary).
The pilot program's first goal was to reduce the time lost to re-verification. The OSI team developed a tool called AIRS to identify software packages and verify their license and copyright in SPDX format. AIRS started out with a command-line interface, but is also usable as a Java library. It uses the Protex code verifier from Black Duck to scan a package and pick out license and copyright information. It then exports this information as SPDX data, including the licenses and copyrights of all components and (perhaps most importantly) the "concluded license" that applies to the combined work as a whole. It identifies files by SHA1 checksum, which helps catch duplicates—meaning that files which have already been scanned and analyzed once do not need to be re-scanned even when directory structures have been rearranged.
The eventual design is for AIRS to store this SPDX data in a central, company-wide database, which can then be queried whenever a new (or a duplicate) package is imported for testing. Right now, teams within the company exchange SPDX information internally using other tools. However, the chief benefit of AIRS is that it can identify the correct license and copyright of a package automatically. Even for a small development team, that demonstrably saves time.
The second goal of the pilot program was to simplify the obligation-satisfaction step, Kim said. For this, the OSI team developed a web tool (tied in to AIRS) that can automatically publish the appropriate license notice for a package on the company's web site. It generates the page for each package based on the stored SPDX data, and even generates a QR Code containing a link to the license page URL. Samsung intends to start putting these URLs on physical product packaging, perhaps as soon as October.
SPDX in the future
Overall, the company was quite happy with the pilot program, Kim said, so work is continuing. The AIRS centralized SPDX database is the first order of business, but there are several other to-do list items. One is support for verification engines other than Protex; another is the ability to identify the same code snippet even when the file checksum changes. The OSI team also wrote its own SPDX parser when developing AIRS, which Kim said he hopes to release as an open source project in its own right.
In reply to an audience question, Kim said that the company may start requiring external software suppliers to provide SPDX data on the packages that they supply. What makes that request tricky is that Samsung is still responsible for verifying that the information is correct, so it will probably have to use AIRS to process the suppliers' code anyway.
Despite its general satisfaction, Samsung ran into several problems with SPDX itself when running its pilot program. First was the "Artifact of Project" property (defined in sections 6.8 to 6.10 of the SPDX specification [PDF]), which is meant to indicate that the file in question belongs to a specific project. In the specification, the cardinality of this property is "one," so a given file can only be associated with a single project. Samsung found that insufficient to record projects that constitute combined works, and had to modify its SPDX output to list every project that a file belongs to.
The property also requires parent projects to be described with the Description of a Project (DOAP) format, which duplicates the same RDF/XML data for every file in a project—a simple database reference would save space. In addition, Kim said, Samsung found it problematic that SPDX does not account for sub-projects within a project, which is a common situation when creating large products. It also ran into problems caused by the fact that SPDX does not enforce a common rule for the formatting of file paths; packages can reference files with relative path names, which makes it difficult to match them up for the purpose of determining the concluded license. Requiring the file paths be normalized would simplify things.
SPDX is often touted for its ability to ensure correctness in license-compliance efforts, so it is interesting to see that it can enable other benefits, too, such as reducing the amount of duplicated work undertaken by developers. Samsung is an enormous company, so even saving a small amount of time on a per-project basis can add up to a lot.
[The author would like to thank the Linux Foundation for
assistance with travel to New Orleans.]
Index entries for this article | |
---|---|
Conference | LinuxCon North America/2013 |
Posted Sep 26, 2013 16:50 UTC (Thu)
by dvdeug (guest, #10998)
[Link]
Posted Sep 26, 2013 18:48 UTC (Thu)
by HIGHGuY (subscriber, #62277)
[Link] (1 responses)
For our team, using the tool at first generated a lot of work.
For any large company dealing with OSS products, such automation is golden.
There are additional benefits to be found in maintaining such a database like providing teams with security alerts or merely forming an internal community around certain packages, promoting reuse and stimulating communication.
Posted Sep 30, 2013 16:05 UTC (Mon)
by dps (guest, #5725)
[Link]
Sometime we either don't have the source code or can't get it under a licence that would allow us to redistribute the source code. Some of the other code is hazardous to mental health or part of our secret sauce (and sometimes both).
A SPDX case study
A SPDX case study
Our company also has a large database that one uploads source code into for license review and that aids the process of verification and license obligation compliance.
Currently we source 2 different Linux distributions and maintaining all of this information throughout regular package updates and other maintenance is quite a burden.
That is why we further automated the process for our team, going from 2-3 manweeks of work per release, down to 2-3 mandays of work per release. Further tuning can probably bring this down to 1-2 mandays.
A SPDX case study