Empowering Public Organizations: Preparing Your Data for the AI Era

Community Article Published April 10, 2025

Overview

Public organizations — including government agencies, libraries, nonprofits, and statistical bureaus, are increasingly recognizing the importance of preparing their data for artificial intelligence. While a large amount of public data remains locked in formats unsuitable for machine learning, institutions are starting to transform this data into machine-readable forms that support AI applications. These efforts demonstrate a global shift toward making public data AI-ready—empowering organizations to enhance community impact, support inclusive AI development, and multiply the value of their data through collaboration. This guide outlines how public organizations can follow suit, with practical strategies for turning valuable but underutilized data into a foundation for public-interest AI.

Example: Aerial imagery of Massachusetts in 2023 obtained from MassGIS, converted to a Hugging Face dataset

Quick Navigation

1. Introduction: Why Public Data Matters More Than Ever

Across sectors and borders, public organizations—from government agencies and statistical bureaus to libraries and nonprofits—are awakening to a critical reality: their data holds immense potential to fuel artificial intelligence, but only if it's prepared in machine-readable, accessible formats.

Public organizations are authoritative sources for critical information: monitoring environmental conditions, tracking educational outcomes, documenting workforce trends, preserving cultural heritage, and managing public infrastructure. However, much of this data exists in formats that AI systems can't easily use — stored in PDFs, scattered across Excel files with inconsistent structures, and more often than not, organized in specialized formats designed for human consumption rather than machine learning. In fact, it is estimated that only about 20% of organizations have data strategies mature enough to take full advantage of AI tools. This limits the ability of AI systems to harness those resources and, in turn, restricts innovation, service improvement, and broader social impact.

Forward-looking organizations and agencies have been leading a quiet transformation — formatting and publicly sharing their data to enable AI-driven tools that reflect public needs and values:

Making public data AI-ready not only advances institutional missions but also ensures that emerging technologies are built on diverse, representative, and trustworthy information. By preparing data for AI, public organizations can:

  • Enable technology that better serves communities, such as personalized learning tools built from standardized testing data.
  • Amplify the value of public data through collaboration, allowing civic technologists and researchers to create new applications without additional organizational burden.
  • Maintain principled control over data use, establishing licensing and access standards that align with public-interest goals.

This guide explores how diverse public organizations are rising to the challenge and offers concrete examples and practical strategies for making data AI-ready via case studies on data from the Massachusetts Data Hub.

Many thanks to Daniel Van Strien for the feedback on this article, and for his rich history of work and expertise in this area that has made dataset work like this possible on the Hugging Face Hub.

2. Public Organizations on Hugging Face Hub

Many government agencies and public organizations are already leading the way in preparing their data for machine learning applications and sharing it on Hugging Face. Here are a few examples:

National Library of Norway (Nasjonalbiblioteket) AI Lab

The National Library of Norway has established itself as a powerhouse on Hugging Face, with over 150 models, 21 datasets, and multiple machine learning demos shared publicly. Their contributions include Norwegian language models trained on texts from their national library collections and Whisper (speech-to-text) models fine-tuned for Sámi languages.

These resources have democratized access to AI for Norwegian citizens, researchers, and businesses, enabling applications ranging from automated transcription services to natural language processing tools specifically optimized for Norwegian cultural context.

NASA

NASA, jointly with IBM has shared multiple datasets on Hugging Face, including satellite imagery, environmental monitoring data, and a geospatial foundation model. NASA CISTO Data Science Group on Hugging Face has also released satellite Vision data and interactive spaces to aid research. This openly available geospatial data has powered several powerful AI applications, for example, researchers have used these datasets to develop models that can quickly assess damage after natural disasters, helping emergency responders prioritize their efforts.

BigLAM Initiative

Developed out of the BigScience project, the BigLAM initiative focuses on making datasets from Galleries, Libraries, Archives, and Museums (GLAM) more accessible for machine learning. They've made over 30 GLAM-related datasets available via the Hugging Face Hub, providing a rich resource for cultural and historical machine learning applications.

These datasets are particularly valuable for researchers working on computer vision models that can recognize cultural artifacts, historical document analysis, and automated metadata generation for archival materials.

One notable example is the European Art Object Detection Dataset (DEArt), which contains 15,000+ paintings from the 12th-18th centuries with 112,000+ annotated objects across 69 categories. This dataset has already empowered researchers to build specialized computer vision models for art analysis, as demonstrated in the YOLOv11 model results shown below:

image/jpeg

This is a non exhaustive list, with many other public organizations such as the Ministry of Culture of France, The National Archives of Finland, and the Ministry of Digital Affairs of Poland contributing datasets and models to public knowledge on the Hub.

3. Tutorial on preparing public data for machine learning with the Massachusetts Data Hub

The Massachusetts Data Hub is an official platform that connects users to data and reports published by Massachusetts state agencies. Launched in 2022, the Data Hub serves as a centralized gateway to hundreds of datasets, allowing users to search for data by topic or keyword without needing to know which specific agency holds the information. It features resources across various domains including business, education, health, environment, transportation, and more, making government data more accessible to the public.

For this case study, we selected four different types of government data from Massachusetts that represent common challenges in public data accessibility. Each of these datasets contains valuable information but was published in formats that created barriers to machine learning applications. By adapting these datasets for machine learning applications, We aim to showcase practical approaches that other public organizations can adopt for their own data.

Education Assessment Data (MCAS)

Contains student performance metrics across all Massachusetts public schools from the MCAS Performance Results portal, including achievement levels in different subjects, growth percentiles, and demographic breakdowns.

Original Format:

Excel spreadsheets with inconsistent formatting across years (2017-2024)

Challenges:

  • Header rows in different positions across files
  • Inconsistent column naming conventions
  • Different data types (numbers stored as text in some files)
  • Achievement level definitions changing over time

Our Approach:

Although the data was already in a structured table allowing export to Excel files, the inconsistent conventions required careful handling. We developed a Python script that automatically identified the correct header row, enforced consistent data types, and found common columns across all files:

# Define string columns that should never be converted to numeric
string_columns = ['District Name', 'District Code', 'Subject', 'Year']
for col in combined_df.columns:
    if col not in string_columns:
        # Convert all other columns to numeric
        combined_df[col] = pd.to_numeric(combined_df[col], errors='coerce')

The script also handled the varying header positions by scanning for key column identifiers:

# Example of how we handled inconsistent header rows in Excel files
for i in range(10):
    if 'DISTRICT NAME' in str(sample_df.iloc[i].values).upper():
        header_row = i
        break

Result and Code:

A consolidated dataset with 6,741 rows covering seven years of assessment data, ready for longitudinal analysis.

The Python notebook to create this dataset can be found here.

What you can do with this data:

With this education assessment dataset, you can build predictive models to identify educational trends, create interactive dashboards for policymakers, or train models to recommend targeted interventions. The Hugging Face Hub offers tools like AutoTrain for no-code machine learning on this structured data, Gradio for building interactive visualizations, and Inference Endpoints for deploying analytics applications. This dataset could help education departments better understand performance patterns and allocate resources more effectively.

Labor Market Information: Regional Workforce Data

Provides detailed statistics on workforce trends, priority industries, and occupations across seven different regions of Massachusetts from the 2019 Regional Labor Market Data on Mass.gov.

Original Format:

PowerPoint files on Mass.gov that were downloaded and converted to PDF format

Challenges:

  • Data originally in PowerPoint presentations, then converted to PDFs
  • Tables mixed with narrative text
  • Inconsistent layouts across regions
  • No machine-readable source provided

Our Approach:

We leveraged the SmolDocling model, a multimodal Image-Text-to-Text model designed for efficient document conversion. This model has OCR capabilities that allowed us to extract structured content from the PDFs converted from PowerPoint files:

# Example of how we processed PDFs with SmolDocling
from transformers import AutoProcessor, AutoModelForVision2Seq

# Load the SmolDocling model
processor = AutoProcessor.from_pretrained("ds4sd/SmolDocling-256M-preview")
model = AutoModelForVision2Seq.from_pretrained("ds4sd/SmolDocling-256M-preview")

# Process a PDF page
def extract_from_pdf_page(page_image):
    inputs = processor(images=page_image, text="Convert this page to docling.", return_tensors="pt")
    outputs = model.generate(**inputs, max_length=1024)
    return processor.decode(outputs[0], skip_special_tokens=True)

The SmolDocling model converted the documents to text while preserving formatting elements, tables and other components of the original documents.

Result and Code:

A structured dataset capturing regional labor market statistics for seven regions of Massachusetts that maintains the context and organization of the original reports.

The Python notebook to create this dataset can be found here.

What you can do with this data:

This labor market dataset enables workforce development agencies to leverage AI for economic planning and industry targeting. You can use Hugging Face's LangChain integration to build RAG applications that answer complex queries about regional workforce trends, or train classification models with SetFit to identify emerging industry clusters from minimal training data. This machine-readable format significantly reduces the time required for regional economic analysis and helps workforce boards make data-driven decisions that better serve both job seekers and employers.

Sentiment analysis and forward looking analysis are powerful tools that can be run on this type of data to analyse macro trends and quickly paint a picture from the reports. Here is a space on Hugging Face that does this with financial statement data:

image/png

Occupational Safety and Health Statistics

Data from the Massachusetts Occupational Safety and Health Statistics Program containing valuable workplace safety information, including injuries by industry, occupation, and demographic data.

Original Format:

PDF reports published on Mass.gov

Challenges:

  • Information locked in PDF format
  • Mix of tables, charts, and narrative text
  • Complex multi-column layouts
  • No structured data exports available

Our Approach:

Since these were already in PDF format, We directly used OCR and document structure recognition capabilities of the SmolDocling model:

# Example of structured extraction from PDFs
def process_safety_data_pdf(pdf_path):
    # Convert PDF pages to images
    images = convert_pdf_to_images(pdf_path)
    
    results = []
    for img in images:
        # Use SmolDocling to extract structured content
        inputs = processor(images=img, text="Convert this page to docling.", return_tensors="pt")
        outputs = model.generate(**inputs, max_length=1024)
        page_content = processor.decode(outputs[0], skip_special_tokens=True)
        results.append(page_content)
    
    # Combine and post-process the results
    return process_doc_tags(results)

Result and Code:

A structured dataset containing workplace safety statistics that maintains the rich context of the original reports while making the data accessible for machine learning applications.

The Python notebook to create this dataset can be found here. Note that this is the same code from the Labor Market example, as once the PPTs were converted to PDF the rest of the process remains the same.

What you can do with this data:

This occupational safety dataset opens up possibilities for risk prediction and prevention strategies. Using the Hugging Face Hub, safety agencies can develop PEFT-tuned language models that extract insights from complex safety reports, create industry-specific risk classifiers with Sentence Transformers, or build interactive safety dashboards using Gradio or Streamlit on Hugging Face Spaces. The structured format enables agencies to move beyond retrospective analysis to proactive risk identification, as is being proposed in Florida, potentially helping prevent workplace injuries before they occur.

Geospatial Imagery: Aerial Photography

High-resolution "leaf off" aerial photographs from the MassGIS 2023 Aerial Imagery database, covering the entire state of Massachusetts, captured in spring 2023. These images support environmental monitoring, urban planning, infrastructure management, and geospatial analysis applications.

Original Format:

JP2 (JPEG 2000) files in an irregular tile structure

Challenges:

  • Specialized JP2 format not supported by many AI vision systems
  • Large file sizes (average 19MB per tile)
  • Images distributed as ZIP files requiring extraction
  • Complex geographic indexing system (US National Grid coordinates)

Our Approach:

We created a pipeline to download, extract, and convert the images, preserving geographic metadata. File handling was particularly challenging:

# Handle potential encoding issues in filenames
def safe_extract(zip_ref, jp2_file, target_path):
    try:
        with zip_ref.open(jp2_file) as source, open(target_path, 'wb') as target:
            shutil.copyfileobj(source, target)
        return True
    except UnicodeEncodeError:
        # Try alternative encoding for filename
        jp2_file_encoded = jp2_file.encode('cp437').decode('utf-8')
        with zip_ref.open(jp2_file_encoded) as source, open(target_path, 'wb') as target:
            shutil.copyfileobj(source, target)
        return True
    except Exception as e:
        print(f"Error extracting {jp2_file}: {e}")
        return False

Additionally, we had to create a sample subsection of the dataset to enable proper previewing in the Hugging Face dataset viewer, as the full-resolution images were too large:

# Create a sample dataset with resized images for preview
df = pd.DataFrame({
    'file_path': [os.path.join(train_dir, row['file_name']) for _, row in metadata_df.iterrows()],
    'tilename': metadata_df['tilename'].tolist(),
    'zone': metadata_df['zone'].astype('int64').tolist()
})

# Resize the images to a manageable size for preview
df['image'] = df['image'].pil.resize((256, 256))

# Save to Parquet for efficient storage and retrieval
df = df[['image', 'tilename', 'zone']]
df.to_parquet(output_path)

Result and Code:

A collection of standardized aerial imagery ready for computer vision and machine learning applications, with a sample preview that makes the dataset more accessible to developers and researchers.

The Python notebook to create this dataset can be found here.

What you can do with this data:

image/png

A Hugging Face space to convert aerial images to maps.

This geospatial dataset enables powerful environmental monitoring and urban planning applications. On the Hugging Face Hub, you can leverage pre-trained vision models like SegFormer or YOLOS for land use classification, build custom object detection models with MMDetection to identify infrastructure elements via open Vision-Language models like SmolVLM2, or use Segment Anything to automatically detect environmental changes. State agencies could develop applications that track urban sprawl, monitor deforestation, assess flood risks, or identify potential solar panel installation sites - all from data they already own.

4. Step-by-Step Guide to Sharing on Hugging Face

Once you've prepared your data for machine learning applications, the next step is sharing it on platforms where AI developers can easily find and use it. The Hugging Face Hub has become a central repository for machine learning datasets and offers several ways to upload your data.

Upload via GUI (Web Interface)

For smaller datasets or initial exploration, the web interface provides a straightforward way to share your data:

  1. Log into your Hugging Face account
  2. Click on your profile → "New Dataset"
  3. Choose a name, visibility settings, and license
  4. Upload files and create documentation

Screenshot of dataset upload

Upload via Code

For larger datasets or automated workflows, you may upload via code:

from datasets import Dataset
import pandas as pd

# Load your data
df = pd.read_csv("your_processed_data.csv")

# Convert to Hugging Face dataset
dataset = Dataset.from_pandas(df)

# Push to Hugging Face Hub
dataset.push_to_hub("your-username/your-dataset-name")

A more detailed discussion of the Hugging Face hub's dataset hosting features can be found here, and the full documentation for uploading datasets can be found here.

5. Technical Takeaways

Here are some key takeaways for organizations looking to make their data more accessible for AI systems:

Identify Your Most Valuable Datasets

Before diving into technical implementations, assess your current data assets:

  • Which datasets would provide the most value to your organization if they were more accessible?
  • What unique information does your organization hold that could power innovation?
  • Which datasets are currently most difficult for your team and external users to work with?
  • Which of your existing datasets already have clear licensing terms that would support AI use?

Prioritize datasets with high potential impact and those with technical barriers to access. This initial assessment will help focus your efforts where they will have the greatest return on investment.

Determine Format Needs

Different use cases require different formats. Consider:

  • Who will be using this data and what are their technical capabilities?
  • What formats would make the data most accessible for your intended audience?
  • What level of processing is needed to make the raw data usable?

By understanding the needs of your data users, you can choose formats that maximize accessibility while minimizing conversion effort.

Document Clearly

Comprehensive documentation is crucial for ensuring your data can be used effectively:

  • Include clear provenance information about where the data came from
  • Explain any transformations or processing applied to the original data
  • Document known limitations, gaps, or quality issues
  • Provide examples of how to load and use the data
  • Clearly specify licensing terms for your data

Good documentation not only helps users understand your data but also builds trust in its quality and appropriateness for different applications. Additionally, well documented data is also consistently the most downloaded - 90% of dataset downloads on the Hugging Face come from fully documented datasets.


By preparing your organization's data for machine learning applications, you are creating opportunities for innovation and amplifying the impact of your existing data assets. With targeted effort, public organizations can adapt their valuable information resources into formats that support innovation, research, and public service that aligns with their mission. We can't wait to see your datasets on Hugging Face 🤗!

Community

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment