Executive Summary

Addressing Nodes Without Genotypes

Integrating clinical, molecular, and social contacts at statewide and national scales (100k+ cases) while avoiding web browser performance degradation.

80%+

Memory Footprint Saved

1M

Est. Record Capacity

The Challenge

The Genotype Gap

1 Visual Mapping Rules

Circular nodes represent TNA-eligible sequenced cases. These are filled with a solid color to indicate sequence availability and eligibility.
Diamond nodes represent ineligible sequenced cases. These are filled with a solid color, distinguishing cases that have sequences but fail quality or length criteria.
Square nodes represent unsequenced cases. These are rendered as hollow with a dashed border to denote the absence of a genetic sequence.
Dashed lines represent social or epidemiological contact links, whereas solid lines denote sequence-based molecular distance links.

Technical Barriers

Web Browser Data Capacity Limits

2 Why Browser Software Fails on Big Data

Standard web browsers enforce strict memory limits on file sizes, causing application freezes or crashes when loading large statewide or national networks.
Storing data as individual row records replicates database field headers for every person, inflating memory consumption during visualization.
Processing entire datasets in a single step monopolizes computer resources, locking user interaction.

512MB

Web Browser Data Processing Limit

1.5GB

Computer Memory Crash Threshold

Solution Architecture

The Proposed Changes

3 Columnar Storage & Responsive Visualizations

Transitioning to a columnar data layout stores persons attributes in flat parallel arrays, eliminating duplicate labels and reducing file sizes.
Dividing data loading into incremental background stages ensures the user interface remains responsive and interactive.
Loading detailed persons attributes on-demand reduces active memory consumption during cluster exploration.

1. Compact Data Payload

Disk and network footprint reduced to approximately 30% of original size.

2. Memory Footprint Dropped by 80%

Simplified memory arrays replace complex database objects.

3. Non-Blocking Page Load

Visual layout renders incrementally to keep the software interactive.

Data Structure

How Columnar Compression Works

Standard Row-Based JSON

// Persons labels repeated for every case
"Nodes": [
  {
    "id": "case1",
    "ehars_uid": "E-102",
    "age_dx": "32",
    "gender": "man"
  },
  {
    "id": "case2",
    "ehars_uid": "E-103",
    "age_dx": "45",
    "gender": "woman"
  }
]

Redundant field labels inflate file sizes and increase computer memory overhead.

Optimized Columnar JSON

// Labels defined once; values stored as lists
"persons_attribute_schema": {
  "ehars_uid": { "type": "String" },
  "age_dx": { "type": "String" },
  "gender": { "type": "String" }
},
"Nodes": {
  "id": ["case1", "case2"],
  "ehars_uid": ["E-102", "E-103"],
  "age_dx": ["32", "45"],
  "gender": ["man", "woman"]
}

Structured data storage eliminates redundant labels and optimizes file sizes.

Advanced Encoding

Specific Column Compression Methods

Encoding Array Types

Run-Length Encoding (RLE) compresses consecutive repeating values such as county codes or stage flags by storing them as value-run pairs.
Front Coding optimizes arrays of long strings with common prefixes (such as accessions or node IDs) by storing overlapping lengths and unique suffixes.
Delta Encoding compresses sequential numeric values by storing differences between consecutive numbers.
Sparse lists are compressed using default value fallbacks combined with key-value exceptions.

Run-Length Encoding (RLE)

"rle": true, "len": 969,
"values": ["Unknown", "Alive"], "runs": [400, 569]

Front Coding

"front": true,
"suffixes": ["XS02H", "3491"], "lens": [0, 5]

Delta Coding

"delta": true, "values": [172800, 86400, 86400]

JS Performance

Software Rendering Optimizations

4 Non-blocking Rendering

Background task scheduling segments network unpacking and cluster rendering to prevent browser freezes.
Selective property loading delays attribute compilation until a specific cluster or node is selected in the dashboard.
Delegating heavy computations to background tasks maintains full interface responsiveness.

Real-World Benchmarks

Benchmarking & Network Statistics

Key Metric: Efficient Compression

Columnar compression reduces the annotated network JSON size for the XS dataset (9,029 records) to 1.09 MB (from 14.4 MB).
For the larger YL dataset (155,796 records, 71,200 nodes), file size is optimized to 6.57 MB (from 397.73 MB).
For the largest XU dataset (152,893 records, 92,477 nodes), file size is optimized to 13.74 MB (from 286.97 MB).
This achieves an 90%+ reduction in active browser memory usage, ensuring responsiveness across all datasets.

XS Network Statistics

Total Nodes (Cases)	6,949
Total Links (Edges)	15,339
Linked (Clustered) Nodes	4,601
Unlinked (Isolated) Nodes	2,348
MSPP Cases (Multiple Sequences)	1,070 (3,120 nodes)

Capacity Analysis

Record Capacity Estimations

1,000,000

Estimated Persons Records Limit

~320MB

Expected Columnar JSON Size

78%

Gzip Compression Ratio

Extrapolated Network Capacity

Standard row-based JSON formats exhaust web browser memory limits at approximately 40,000 cases when clinical metadata is included.
Columnar structures scale efficiently to support network sizes of up to one million cases without exceeding browser memory limits.
Standard network compression reduces a one-million-case dataset to less than 100MB for efficient web transmission.

UI/UX Rendering

Visualizing Unsequenced & Unclustered Cases

5 Topographic Distinction in Network Layouts

Cases without genetic sequence data (unsequenced nodes) are represented as hollow dashed squares in cluster views to show critical social links.
Unclustered cases are rendered with reduced size and opacity to provide epidemiologic context without cluttering the primary cluster.
Selecting any case highlights its social and contact links to unsequenced partners using dashed connection lines.

Data Integration

Importing Social & Contact Networks

6 Integrating Non-Molecular Linkages

Public health investigations combine molecular surveillance data with social and epidemiological contact networks.
Two simple CSV templates allow importing these non-molecular linkages and node-level risk attributes.
The visualization engine overlay merges social links (dashed lines) with molecular distance links (solid lines) in real time.
This integrates unsequenced partners directly into the transmission network context, identifying bridging opportunities.

Social Edges (Contacts) CSV

Index,Partner,Contact
case1_id,case2_id,Social
case2_id,unseq1_id,Social

Social Attributes CSV

Index,SexWork,SubstanceUse,InternetDating
case1_id,Yes,No,Yes
unseq1_id,No,Yes,No

Subnetwork Operations

Adding Unsequenced Nodes to COI

7 Identifying Bridging Cases and Outbreaks

Context-sensitive menus allow investigators to assign unsequenced cases directly to new or existing clusters of interest.
When an unsequenced case links two molecular clusters epidemiologically, the system supports merging them into a unified outbreak cluster of interest.
Outbreak tracking tables calculate both sequenced and unsequenced cases to display the true scale of the transmission network (e.g., 34 sequenced plus 10 unsequenced cases).

Cluster Table View:

34 + 10 (unseq)

Combined Case Count Column

User Interface

Querying & Searching Unsequenced Cases

8 Outbreak Filters and Multi-Cluster Identification

Individuals linked to multiple clusters are represented with multi-cluster identifier values to enable rapid search and cross-cluster investigation.
Node search tables support sorting and filtering by sequence status to isolate unsequenced or poor quality data.
The query builder allows public health officials to filter cases by clinical attributes, cluster boundaries, or sequence availability.

QueryBuilder Config Example

// Match all multi-cluster individuals
{
  "condition": "AND",
  "rules": [
    {
      "field": "CLUSTER_ID",
      "operator": "contains",
      "value": ";"
    }
  ]
}

Data Quality Control

Sequence Quality & TNA Eligibility

9 Quality Control and Eligibility Verification

Sequence quality and treatment-naive eligibility metrics are calculated automatically during the quality control stage of the pipeline.
The network annotation script incorporates these quality metrics directly into the visualization schema.
Good quality, poor quality, and unsequenced cases are labeled distinctly to assist in prioritizing cases for molecular surveillance.

Persons Attribute Schema Mapping

// Attribute schema defined in JSON
"persons_attribute_schema": {
  "poor_quality": { "type": "String" },
  "fraction_ambig": { "type": "Number" },
  "TNA_eligible": { "type": "String" }
},
// Parallel arrays in persons_attributes
"persons_attributes": {
  "poor_quality": ["No", "Yes", null],
  "fraction_ambig": [0.0015, 0.052, null],
  "TNA_eligible": ["Yes", "No", null]
}

Enables granular searching, filtering, and case color-coding in the visualization dashboard.

Codebase Integrity

System Architecture Updates

Component	Updates	Public Health Impact
Web Visualization Dashboard	Optimized memory allocation and non-blocking loading stages.	Supports seamless interactive visualization of 100k+ cases without application freezes or crashes.
Network Annotation Engine	Integrated sequence quality metrics and unsequenced cases into the network schema.	Enables full display of social links, data quality flags, and unsequenced cases in outbreak tables.
Data Quality Control Pipeline	Integrated sequence quality reports directly into the network generation pipeline.	Automates identification of poor quality and treatment-naive cases for quick prioritization.