Prioritization is the process we used for deciding in which order to publish our high-value public datasets to the portal. Our inventory included 136 datasets to choose from, and 44 of those were on the portal at launch. This chapter describes how we built our inventory and how we prioritized the results.
It is important to note that our inventory and portal prioritization processes are fluid. Although we describe the various steps below sequentially, some steps can occur, and did occur, concurrently, depending on the data, as we became more comfortable with our process.
The Open Data Policy places great emphasis on publishing an inventory of high-value public datasets and ensuring all datasets in the inventory are available on a public portal by 2020. This inventory is published at datasd.org, the official information site for the City’s Data & Analytics program. 115 high-value public datasets were included in the initial publication of the inventory on February 29, 2016. Today, our inventory contains 136 datasets.
As we improve our methods for locating and accessing data across the City, the inventory will grow. Initially, we used three major sources for the inventory: the raw results from the first City-wide manual survey conducted in 2015; our discovery meetings with information coordinators and data stewards about their data sources; and reports, web pages and conversations with City employees. In the future, we will be working to automate methods for finding data.
The manual dataset survey
During the six months leading up to the last Open Data Policy implementation update, the City’s newly hired Chief Data Officer designed and conducted a first-ever manual survey of City data. This massive effort involved the contributions of 65 information coordinators in 35 departments and programs.
The survey accomplished a number of objectives:
Acquire a base-level understanding of the data the City has at its disposal
Create and collect basic attributes describing this data
Establish a competent group of information coordinators across the departments
Identify people currently working with and managing data as well as their experience and knowledge of data tools and best practices
Understand which of the City’s software systems are authoritative for various sets of information
Identify critical data and be able to prioritize it for release or internal access
Preliminary returns from the survey did not immediately become the official inventory of high value public datasets. The purpose was much broader, as outlined above, and only functioned as a starting point for the creation of the official inventory. For more background on the inventory effort, see the first Open Data Implementation update report published July 1, 2015.
Making sense out of the survey submissions
No one person in the City is familiar with the specialized knowledge of all 65 information coordinators who contributed to the survey. Our team faced a big task in reviewing and understanding each submission, especially because submissions varied in their completeness, and some were described in language that is not meaningful to anyone outside of the department that uses it every day.
Additionally, many entries did not meet the definition of dataset. We cast a wide net for potential datasets because we preferred to evaluate whether something was a dataset rather than miss a potential dataset. We have the ability to revisit information we obtained for further analysis and consideration, but much of what was submitted did not meet the criteria for inclusion in the official inventory.
Public versus private
The most important criteria for whether the City would include a dataset in the inventory, or ultimately the portal, is whether it meets the definition of “public.” The Open Data Policy identifies several circumstances under which a dataset is not public and thus generally not subject to the open data policy. These include exemptions under federal, state or local law, rule or regulation; information that describes employees or the internal deliberative processes of the City; proprietary software, computer code and applications; and data that exists in a location that is assigned to a single City employee or a single City-owned or City-controlled computing device. Datasets that meet at least one of these criteria are not subject to inclusion in the inventory or the portal.
Another criteria for determining which datasets are subject to publication is whether a public dataset is high-value. One single person cannot judge the ultimate value of a dataset, so as we review potential datasets for value, we take into consideration the points-of-view of several personas: a software developer, a community activist, a resident, the Mayor and Council members, a department director and a City staff person, such as a traffic engineer. Considering these personas, a dataset might be the basis for a web application; it could drive decisions about policy or operations; or it could satisfy curiosity about major issues facing residents.
Another way to determine value is to look at how often the public requests the data. If a highly requested dataset is available on an Open Data Portal, City staff will spend less time filling requests for that dataset. At the time of our initial inventory, the City was not centrally evaluating California Public Records Act (CPRA) requests for the purposes of the open data program. As a result, we asked information coordinators to flag datasets frequently requested under the CPRA. To further increase our understanding of datasets requested under the CPRA, and inform our inventory and publication efforts, Performance & Analytics has worked with the Human Resources Department on an online public record request tool at sandiego.nextrequest.com. Going forward, not only will this tool help the public efficiently request and search public records, but it will also allow us to better identify frequently requested datasets to add to our inventory, our portal, or both.
The final factor helping us determine value of datasets, particularly for our initial portal launch, is the voting option available to everyone - the public, other City employees and decision-makers - at datasd.org. Users can page through the inventory and submit a vote for each dataset that they are interested in accessing on the Open Data Portal. The vote results helped determine which datasets were published first. We will continue providing this option and also plan to include in the inventory datasets we did not initially consider high-value so the public has an opportunity to provide further input.
Prioritizing the inventory
Once we had identified our initial inventory of high-value public datasets, we screened the list using the various factors below in order to decide what we could publish to the portal first. The inventory process generated basic information related to these factors, but a thorough review required gathering additional information from information coordinators, data stewards, department directors, the City’s legal team and the City’s Chief Information Security Officer.
If a dataset has the potential to expose the City or residents to security risks, the City will not include it in the inventory or the portal. City attorneys and the City’s Chief Security Officer weigh in on this decision.
The City is also committed to protecting any private information about employees and residents. If a dataset includes private information that would need to be removed or aggregated before publication, it may appear in the inventory but could take longer to publish to the portal to allow for extra work and processing time.
Information coordinators were asked to flag datasets for which quality might be a concern. If we determined that the quality could improve, for example by removing older data, the dataset may appear in the inventory, but take longer to publish to the portal. When possible, we offer advice and help to improve data collection methods.
The City has high standards for the data that goes on the portal. Readiness refers to the amount of work it would take to get the dataset onto the Open Data Portal using no human intervention. Several characteristics factor into this, so extracting a dataset from a primary source and delivering it to the portal could be a very involved process.
If a dataset is already published somewhere on the City website and can be downloaded in a machine-readable format, we work with the coordinators to connect directly to the source system and move that data onto the portal, which eases the burden of publishing the dataset. Often, data is extracted and manipulated by hand before online publication, and we want to take that time-sucking process off the data owner’s hands. Many datasets need an automated process for connecting, extracting, cleaning and uploading to the portal. In that case, they simply are not readily available, which means that they may appear on the inventory but will take longer to publish to the portal
The 44 datasets at the top of the list after prioritization were published to the portal at launch, and several more are at some stage in the publication process. By 2020, all high-value public datasets must either be published to the portal or listed in a compliance plan along with the reasons they could not be published and timelines for eventual publication.