Agent Skill
2/7/2026

db23-pipeline

Run the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync.

T
tankygranny05
0GitHub Stars
1Views
npx skills add tankygranny05/agent-box

SKILL.md

Namedb23-pipeline
DescriptionRun the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync.

name: db23-pipeline description: Run the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync.

DB23 6-Step Pipeline

[Created by Sonnet: ea2be943-c1f3-4d1e-b9ae-c326b0eb08d7] [Edited by Opus: 8cdaee7c-f8aa-4e2b-b055-18cff1267dba]

Run the 6-step data ingestion pipeline for Vietnamese stock market trading data.

Critical Requirements

  • PYTHONHASHSEED=0 is mandatory for ALL steps
  • Sequential execution - run steps 1→6 in order, one day at a time
  • Single-day files only - use db23-excel-split skill first for multi-day files
  • Step 5 --day flag - MUST specify --day when re-processing existing days
  • BACKUP FIRST - Always backup existing data before running pipeline (see below)

Data Backup Warning

CRITICAL: The pipeline OVERWRITES existing data!

When you run the pipeline for a day that already has data in the output folders, the existing files WILL BE OVERWRITTEN without warning:

StepOutput LocationOverwritten?
Step 1/db23/parsed_excel_files/{day}.pickleYES
Step 2/db23/df_new_day_numerical/{day}.pickleYES
Step 2/db23/df_new_day_str/{day}.pickleYES

Before running the pipeline, ALWAYS backup existing data:

# Create backup directory with your agent suffix
backup_dir="/tmp/agent-{suffix}-pickle-backup"
mkdir -p "$backup_dir/parsed_excel_files"
mkdir -p "$backup_dir/df_new_day_numerical"
mkdir -p "$backup_dir/df_new_day_str"

# Backup existing files for the days you're processing
day="2025_12_16"
cp /db23/parsed_excel_files/${day}.pickle "$backup_dir/parsed_excel_files/" 2>/dev/null
cp /db23/df_new_day_numerical/${day}.pickle "$backup_dir/df_new_day_numerical/" 2>/dev/null
cp /db23/df_new_day_str/${day}.pickle "$backup_dir/df_new_day_str/" 2>/dev/null

When to backup:

  • Processing days that already exist in the system
  • Re-running pipeline after an error
  • Any time you're not 100% certain the new data is correct and complete

Why this matters:

  • Old data may contain information not present in new data (see Incident 2026_01_06)
  • Without backup, you cannot recover from bad pipeline runs
  • Backups enable data merging when old + new are complementary

Agent Instructions for Multi-Day Processing

CRITICAL: When processing multiple days, agents MUST follow these rules:

  1. Run each day independently - Execute the full 6-step pipeline for one day, then move to the next. NEVER write a bash script or loop to process multiple days at once.

  2. Continue on error - If a day fails (any step), do NOT stop processing remaining days. Instead:

    • Use say "Pipeline error on YYYY MM DD" to alert the user
    • Log which step failed
    • Proceed to the next day immediately
  3. Report at the end - After all days are attempted, summarize:

    • Days that succeeded
    • Days that failed (and which step)

Example agent behavior for 3 days:

# Day 1: 2025_12_17
[Run steps 1-6 for 2025_12_17]
✓ Success → say "Day 2025 12 17 pipeline complete"

# Day 2: 2025_12_18
[Run steps 1-6 for 2025_12_18]
✗ Step 3 fails → say "Pipeline error on 2025 12 18"
→ Continue to next day (don't stop!)

# Day 3: 2025_12_19
[Run steps 1-6 for 2025_12_19]
✓ Success → say "Day 2025 12 19 pipeline complete"

# Final summary
Succeeded: 2025_12_17, 2025_12_19
Failed: 2025_12_18 (Step 3)

Why no multi-day scripts?

  • Each day's pickle depends on prior state
  • Errors are easier to diagnose day-by-day
  • User can intervene between days if needed

Pipeline Overview

StepScriptWhat it does
1parse_excel_file.pyParse Excel → /db23/parsed_excel_files/{day}.pickle
2label_parsed_excel_file_new.pyLabel, convert string→numerical, update s2i/i2s
3push_day_to_df_trans.pyAppend day to master df_trans_numerical.pickle
4push_s2i_and_latest_day_to_remote.pyUpload pickles to remote server (m3u)
5remote_compute_df_trans.pyTrigger remote merge via merge_to_df_trans.py
6get_latest_day_from_m3u_then_push_to_kame.pySync results to kame

Quick Start - Full Pipeline

day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/parse_excel_file.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Quick Start - Steps 2-6 Only (Crafted Pickles)

For days with crafted pickles (no Excel file), skip Step 1:

day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Individual Steps

Step 1: Parse Excel

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/parse_excel_file.py --day YYYY_MM_DD

Input: /home/ubuntu/Downloads/5 DDMM.xlsx Output: /db23/parsed_excel_files/YYYY_MM_DD.pickle Time: ~5-6 seconds

Skip if: Using crafted pickle (Step 1 would overwrite it!)

Step 2: Label and Convert

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/app/label_parsed_excel_file_new.py --day YYYY_MM_DD

Input: /db23/parsed_excel_files/YYYY_MM_DD.pickle Output: /db23/df_new_day_numerical/YYYY_MM_DD.pickle, /db23/df_new_day_str/YYYY_MM_DD.pickle Time: ~45-50 seconds

NaN check output (expected):

day 0 nan count
stock 0 nan count
stk 0 nan count
name 0 nan count
id 0 nan count
address 0 nan count
group 179134 nan count   <- EXPECTED (filled later)

Step 3: Push to df_trans

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/push_day_to_df_trans.py --day YYYY_MM_DD

Input: /db23/df_new_day_numerical/YYYY_MM_DD.pickle Output: Updated /db23/pickles/df_trans_numerical.pickle Time: ~15-20 seconds

Success indicator: SUCCEEDED! (/db23/pickles/df_trans_numerical.pickle)

Step 4: Upload to Remote

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/app/push_s2i_and_latest_day_to_remote.py --day YYYY_MM_DD

Uploads to: m3u (sotola@m3u:22222) Time: ~30-60 seconds (network dependent)

Step 5: Remote Compute + Verification

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/remote_compute_df_trans.py --day YYYY_MM_DD

Remote script: merge_to_df_trans.py (replaces old known_good_version.py) Time: ~40-70 seconds (+ ~10s for verification)

CRITICAL: The --day flag is REQUIRED when re-processing existing days. Without it, the remote script auto-detects the "latest" day by filename, which may not be the day you're processing.

  • New day (not in remote df_trans): --day optional
  • Existing day (re-processing): --day REQUIRED

Integrated Verification: Step 5 now automatically verifies after merge:

  1. Local row count - reads /db23/pickles/df_trans_numerical.pickle
  2. Remote row count - reads remote df_trans via SSH
  3. NaN check - ensures no NaN in stk column (both local & remote)
  4. Match check - local and remote row counts must be equal

Verification Output:

=== VERIFICATION: 2025_12_17 ===

=== LOCAL VERIFICATION ===
Local row count for 2025_12_17: 50,149
✓ Local stk column: No NaN values

=== REMOTE VERIFICATION ===
Remote row count for 2025_12_17: 50,149
✓ Remote stk column: No NaN values

=== VERIFICATION SUMMARY ===
✓ Local: 50,149 rows, no NaN
✓ Remote: 50,149 rows, no NaN
✓ Row counts MATCH: 50,149

=== ALL VERIFICATIONS PASSED ===

If verification fails:

  • Local NaN in stk: Steps 1-4 were compromised - re-run from Step 1
  • Remote NaN in stk: Remote merge issue - re-run Step 5
  • Row count mismatch: Remote didn't process the uploaded day - check --day flag

Step 6: Sync to Kame

PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
  python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py

Syncs to: kame (kame@kamegoku:22)

Expected Timeouts

StepNormalRecommended Timeout
Step 16s10s
Step 250s70s
Step 320s30s
Step 460s90s
Step 570s100s
Step 630s60s
Full pipeline~4 min~6 min

Data Locations

PathContents
/home/ubuntu/Downloads/Input Excel files (5 DDMM.xlsx)
/db23/parsed_excel_files/Per-day parsed pickles
/db23/df_new_day_numerical/Daily numerical dataframes
/db23/df_new_day_str/Daily string dataframes
/db23/pickles/Master pickles (df_trans, s2i/i2s)

Common Mistakes

MistakeConsequencePrevention
Running Step 1 on crafted dayOverwrites crafted pickle with empty/wrong dataSkip Step 1 for crafted days
Missing PYTHONHASHSEED=0KeyError with large negative numberAlways include in every command
Running steps in parallelData corruptionRun sequentially
Processing multiple days at oncePickle conflictsOne day at a time
Step 5 without --day for existing dayRemote processes wrong day (latest by filename)Always pass --day for historical days

Communication Protocol

# On success
say "Day YYYY MM DD pipeline complete"

# On failure
say "Pipeline error on YYYY MM DD"

Remote Servers

ServerSSHPurpose
m3usotola@m3u:22222Primary compute
kamekame@kamegoku:22Backup/sync

Script Locations

ScriptLocalRemote
merge_to_df_trans.py/Users/sotola/PycharmProjects/remote_to_m3_ultra/workflows/Same path
known_good_version.pySame directoryDeprecated - ignored --day flag

Incident Handling

This section documents resolved incidents and their solutions. When encountering similar issues, check here first.

Incident 2026_01_06: Complementary Data Overwrite (Day 2025_12_16)

Problem:

  • Old data for 2025_12_16 had 102,122 rows of regular orders only (is_pt=0)
  • New Excel file for 2025_12_16 had 51 rows of PT transactions only (Sheet1 empty, Sheet2 had 51 PT)
  • Running the pipeline overwrote old data with new, losing 102,122 regular orders

Symptoms:

  • After pipeline: Dec 16 had only 102 rows (51 PT × 2 for buyer+seller)
  • Before pipeline: Dec 16 had 102,122 rows (all non-PT)
  • Data was complementary, not replacement

Root Cause:

  • Multi-day Excel file was split, but the split only contained PT data for Dec 16
  • Step 1 overwrote the existing parsed pickle with PT-only data
  • No backup was made before running

Resolution:

  1. Restore old Step 1 output from backup: /tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle
  2. Concatenate old (102,122 non-PT) + new (102 PT) with matching columns:
    import pandas as pd
    df_old = pd.read_pickle('/tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle')
    df_new = pd.read_pickle('/db23/parsed_excel_files/2025_12_16.pickle')
    # Reorder columns to match, cast price dtype
    df_new = df_new[df_old.columns]
    df_new['price'] = df_new['price'].astype(df_old['price'].dtype)
    df_combined = pd.concat([df_old, df_new], ignore_index=True)
    df_combined.to_pickle('/db23/parsed_excel_files/2025_12_16.pickle')
    
  3. Re-run Steps 2-6 for the day
  4. Verify: 102,224 rows (102,122 non-PT + 102 PT)

Prevention:

  • Always backup before running pipeline on existing days
  • Check new data source for completeness before processing
  • Compare row counts: if new << old, investigate before proceeding

Key Lesson: Old and new data can be complementary (each has data the other doesn't), not just replacement (new supersedes old). When in doubt, backup and compare before overwriting.

Skills Info
Original Name:db23-pipelineAuthor:tankygranny05