2 File Sync
Anton Nesterov edited this page 2026-02-26 16:23:25 +01:00

File Sync for Replicas

When using local storage (not S3), replicas need a way to sync uploaded files from the master server. VSKI provides automatic file synchronization as part of the replication system.

Overview

┌─────────────────────────────────────────────────────────────┐
│                         Master                               │
│  ┌─────────────┐    ┌─────────────────┐    ┌─────────────┐  │
│  │  Database   │    │  _file_journal  │    │   Storage   │  │
│  │  default.db │◄───│  id | op | path │◄───│   /files    │  │
│  └──────┬──────┘    └────────┬────────┘    └──────┬──────┘  │
└─────────┼────────────────────┼────────────────────┼─────────┘
          │                    │                    │
          │ DB Sync            │ Journal Sync       │ File Download
          │                    │                    │
          ▼                    ▼                    ▼
┌─────────────────────────────────────────────────────────────┐
│                         Replica                              │
│  ┌─────────────┐                              ┌───────────┐  │
│  │  Database   │                              │  Storage  │  │
│  │  default.db │                              │  /files   │  │
│  └─────────────┘                              └───────────┘  │
└─────────────────────────────────────────────────────────────┘

Key Components

Component Description
_file_journal Table tracking file operations (add/delete)
File Journal Sync Replica fetches journal entries after DB sync
File Download Replica downloads new files from master
File Deletion Replica removes local files marked as deleted
Acknowledgment Replica confirms sync, master cleans up journal

How It Works

1. Recording File Operations

When a file is uploaded or deleted on the master:

-- On file upload
INSERT INTO _file_journal (operation, path, created) 
VALUES ('add', 'collection_id/record_id/document.pdf', datetime('now'));

-- On file deletion  
INSERT INTO _file_journal (operation, path, created)
VALUES ('delete', 'collection_id/record_id/document.pdf', datetime('now'));

The journal is stored in default.db and replicates automatically with the database.

2. Sync Process

Every sync cycle, the replica:

  1. Syncs database - Gets the latest _file_journal entries
  2. Fetches journal - Requests entries since last sync
  3. Downloads files - For each add operation, downloads file from master
  4. Deletes files - For each delete operation, removes local file
  5. Acknowledges - Tells master which entries were processed

3. Journal Cleanup

The master uses a dual cleanup strategy:

Strategy Trigger Purpose
Ack-based Replica confirms sync Immediate cleanup after sync
Time-based Entries older than N days Safety net for failed syncs

Configuration

Master Server

File journal is created automatically when:

  • Not in replica mode
  • Not using S3 storage
# Master configuration
JWT_SECRET=your-shared-secret
STORAGE_PATH=./data/files          # Local storage path
FILE_SYNC_RETENTION_DAYS=7         # Time-based cleanup (default: 7)

Replica Server

File sync is enabled automatically when:

  • Running in replica mode
  • Not using S3 storage
# Replica configuration
REPLICA_MODE=replica
MASTER_URL=http://master:3001
JWT_SECRET=your-shared-secret      # Must match master
STORAGE_PATH=./data/files          # Local storage path
SYNC_INTERVAL=60                   # Sync every 60 seconds

Environment Variables

Variable Default Description
FILE_SYNC_RETENTION_DAYS 7 Days to retain journal entries
STORAGE_PATH ./data/files Path for file storage
SYNC_INTERVAL 60 Sync interval in seconds

When File Sync is Disabled

File sync is automatically skipped when using S3 storage:

# With S3, files are stored in the bucket
# No file journal is created
# Replicas access the same S3 bucket
S3_ENDPOINT=https://s3.amazonaws.com
S3_BUCKET=my-bucket
S3_ACCESS_KEY=xxx
S3_SECRET_KEY=xxx

Gotchas & Edge Cases

File Path Format

Files are stored with the path structure:

{storage_path}/{collection_id}/{record_id}/{filename}

Examples:

  • files/abc123/def456/document.pdf
  • files/users/user789/avatar.png

Record Deletion

When a record is deleted, all associated files are recorded in the journal:

// Deleting a record triggers file cleanup
await client.collection("posts").delete("record_id");

// The journal records all file deletions
// Replica will delete all files for this record

Bulk Operations

Bulk deletes also trigger file journal entries:

// Bulk delete - files for all records are journaled
await client.collection("posts").bulkDelete(["id1", "id2", "id3"]);

Sync Timing

File sync happens after database sync in each cycle:

  1. Database sync completes
  2. File journal entries fetched
  3. Files downloaded/deleted
  4. Acknowledgment sent

If file sync fails, the journal entries remain for the next cycle.

Network Interruptions

If file download fails:

  • Journal entry is NOT acknowledged
  • File will be retried on next sync
  • Other files continue syncing
  • Error is logged but doesn't stop sync

Large Files

Large files are downloaded with:

  • 5-minute HTTP timeout
  • SHA256 checksum verification
  • Atomic file writes (temp file → final location)

Storage Consistency

The replica's storage matches the master's structure:

Master Storage                 Replica Storage
./data/files/                  ./data/files/
├── col1/                      ├── col1/
│   └── rec1/                  │   └── rec1/
│       └── file.pdf           │       └── file.pdf
└── col2/                      └── col2/
    └── rec2/                      └── rec2/
        └── image.png                  └── image.png

Monitoring

Check File Journal Status

# On master, query the journal
sqlite3 data/db/default.db "SELECT COUNT(*) FROM _file_journal"

# View recent entries
sqlite3 data/db/default.db \
  "SELECT * FROM _file_journal ORDER BY id DESC LIMIT 10"

Sync Logs

Watch for file sync activity:

# Replica logs
INFO starting replication sync
INFO replication sync completed
INFO syncing files count=5
DEBUG downloaded file path=col1/rec1/doc.pdf
DEBUG deleted local file path=col2/rec2/old.png

Troubleshooting

Files Not Syncing to Replica

  1. Check S3 configuration - File sync is skipped with S3
  2. Verify STORAGE_PATH - Ensure replica has write access
  3. Check sync interval - Files sync after DB sync
  4. Review journal - Check for entries in _file_journal

File Journal Growing Large

  1. Check replica connectivity - Ack-based cleanup depends on replicas
  2. Verify retention setting - FILE_SYNC_RETENTION_DAYS
  3. Manual cleanup - Run CleanupOlderThan manually
-- Manual cleanup (entries older than 7 days)
DELETE FROM _file_journal 
WHERE created < datetime('now', '-7 days');

Checksum Mismatch

If file checksum fails:

  1. File is NOT saved to replica
  2. Journal entry remains unacknowledged
  3. Retry happens on next sync
  4. Check master file integrity

Replica Has Extra Files

Files deleted on master will be deleted on replica during sync. If replica has extra files:

  1. They were created before replication was set up
  2. They were created directly on replica (shouldn't happen - read-only)
  3. Manual cleanup may be needed

Best Practices

1. Configure Adequate Retention

# For frequent sync (every minute)
FILE_SYNC_RETENTION_DAYS=1   # 1 day is sufficient

# For infrequent sync (hourly or more)
FILE_SYNC_RETENTION_DAYS=7   # More buffer for issues

2. Monitor Disk Space

Both master and replica need space for:

  • Database files
  • Uploaded files
  • Temporary download files

3. Use Consistent Storage Paths

# Same path structure on both servers
# Master
STORAGE_PATH=/var/lib/vski/files

# Replica
STORAGE_PATH=/var/lib/vski/files

4. Test File Sync

# Upload a file on master
curl -X POST http://master:3001/api/collections/docs/records \
  -H "Authorization: Bearer $TOKEN" \
  -F "file=@test.pdf"

# Wait for sync interval
sleep 60

# Verify on replica
curl http://replica:3002/files/docs/record_id/test.pdf

5. Plan for Large Files

For systems with large files:

  • Increase SYNC_INTERVAL to reduce bandwidth
  • Monitor sync duration
  • Consider S3 for better scalability

API Endpoints

The file sync uses these internal replica endpoints:

Endpoint Description
GET /api/replica/files?since= Get journal entries since ID
GET /api/replica/file/*path Download file from master
POST /api/replica/files/ack Acknowledge synced entries

These are internal endpoints used by the sync process and require replica authentication.

Architecture Decisions

Why Store Journal in Database?

  1. Atomic with DB sync - Journal replicates with the database
  2. No separate state - Single source of truth
  3. Automatic consistency - DB transaction guarantees
  4. Simple recovery - Restore DB, get journal state

Why Dual Cleanup Strategy?

  1. Ack-based - Immediate cleanup when replicas confirm
  2. Time-based - Safety net for:
    • Offline replicas
    • Failed acknowledgments
    • New replicas (don't need old entries)

Why Not Real-time File Sync?

  1. Batching - More efficient for multiple files
  2. Consistency - Files sync after DB is consistent
  3. Simplicity - No separate file sync connection
  4. Reliability - Retry on next cycle if failed