db23-pipeline
Run the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync.
SKILL.md
| Name | db23-pipeline |
| Description | Run the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync. |
name: db23-pipeline description: Run the 6-step db23 trading data ingestion pipeline. Use when processing single-day Excel files through the pipeline, when user mentions "run pipeline", "6-step", "push day", "label day", or needs to ingest a new trading day into df_trans. Covers all steps from parsing Excel to remote sync.
DB23 6-Step Pipeline
[Created by Sonnet: ea2be943-c1f3-4d1e-b9ae-c326b0eb08d7] [Edited by Opus: 8cdaee7c-f8aa-4e2b-b055-18cff1267dba]
Run the 6-step data ingestion pipeline for Vietnamese stock market trading data.
Critical Requirements
- PYTHONHASHSEED=0 is mandatory for ALL steps
- Sequential execution - run steps 1→6 in order, one day at a time
- Single-day files only - use
db23-excel-splitskill first for multi-day files - Step 5 --day flag - MUST specify
--daywhen re-processing existing days - BACKUP FIRST - Always backup existing data before running pipeline (see below)
Data Backup Warning
CRITICAL: The pipeline OVERWRITES existing data!
When you run the pipeline for a day that already has data in the output folders, the existing files WILL BE OVERWRITTEN without warning:
| Step | Output Location | Overwritten? |
|---|---|---|
| Step 1 | /db23/parsed_excel_files/{day}.pickle | YES |
| Step 2 | /db23/df_new_day_numerical/{day}.pickle | YES |
| Step 2 | /db23/df_new_day_str/{day}.pickle | YES |
Before running the pipeline, ALWAYS backup existing data:
# Create backup directory with your agent suffix
backup_dir="/tmp/agent-{suffix}-pickle-backup"
mkdir -p "$backup_dir/parsed_excel_files"
mkdir -p "$backup_dir/df_new_day_numerical"
mkdir -p "$backup_dir/df_new_day_str"
# Backup existing files for the days you're processing
day="2025_12_16"
cp /db23/parsed_excel_files/${day}.pickle "$backup_dir/parsed_excel_files/" 2>/dev/null
cp /db23/df_new_day_numerical/${day}.pickle "$backup_dir/df_new_day_numerical/" 2>/dev/null
cp /db23/df_new_day_str/${day}.pickle "$backup_dir/df_new_day_str/" 2>/dev/null
When to backup:
- Processing days that already exist in the system
- Re-running pipeline after an error
- Any time you're not 100% certain the new data is correct and complete
Why this matters:
- Old data may contain information not present in new data (see Incident 2026_01_06)
- Without backup, you cannot recover from bad pipeline runs
- Backups enable data merging when old + new are complementary
Agent Instructions for Multi-Day Processing
CRITICAL: When processing multiple days, agents MUST follow these rules:
-
Run each day independently - Execute the full 6-step pipeline for one day, then move to the next. NEVER write a bash script or loop to process multiple days at once.
-
Continue on error - If a day fails (any step), do NOT stop processing remaining days. Instead:
- Use
say "Pipeline error on YYYY MM DD"to alert the user - Log which step failed
- Proceed to the next day immediately
- Use
-
Report at the end - After all days are attempted, summarize:
- Days that succeeded
- Days that failed (and which step)
Example agent behavior for 3 days:
# Day 1: 2025_12_17
[Run steps 1-6 for 2025_12_17]
✓ Success → say "Day 2025 12 17 pipeline complete"
# Day 2: 2025_12_18
[Run steps 1-6 for 2025_12_18]
✗ Step 3 fails → say "Pipeline error on 2025 12 18"
→ Continue to next day (don't stop!)
# Day 3: 2025_12_19
[Run steps 1-6 for 2025_12_19]
✓ Success → say "Day 2025 12 19 pipeline complete"
# Final summary
Succeeded: 2025_12_17, 2025_12_19
Failed: 2025_12_18 (Step 3)
Why no multi-day scripts?
- Each day's pickle depends on prior state
- Errors are easier to diagnose day-by-day
- User can intervene between days if needed
Pipeline Overview
| Step | Script | What it does |
|---|---|---|
| 1 | parse_excel_file.py | Parse Excel → /db23/parsed_excel_files/{day}.pickle |
| 2 | label_parsed_excel_file_new.py | Label, convert string→numerical, update s2i/i2s |
| 3 | push_day_to_df_trans.py | Append day to master df_trans_numerical.pickle |
| 4 | push_s2i_and_latest_day_to_remote.py | Upload pickles to remote server (m3u) |
| 5 | remote_compute_df_trans.py | Trigger remote merge via merge_to_df_trans.py |
| 6 | get_latest_day_from_m3u_then_push_to_kame.py | Sync results to kame |
Quick Start - Full Pipeline
day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python gen_data/parse_excel_file.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py
Quick Start - Steps 2-6 Only (Crafted Pickles)
For days with crafted pickles (no Excel file), skip Step 1:
day='YYYY_MM_DD'
cd /Users/sotola/PycharmProjects/mac_local_m4 && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python app/label_parsed_excel_file_new.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python gen_data/push_day_to_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python app/push_s2i_and_latest_day_to_remote.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python automated_pipelines/remote_compute_df_trans.py --day $day && \
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py
Individual Steps
Step 1: Parse Excel
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/parse_excel_file.py --day YYYY_MM_DD
Input: /home/ubuntu/Downloads/5 DDMM.xlsx
Output: /db23/parsed_excel_files/YYYY_MM_DD.pickle
Time: ~5-6 seconds
Skip if: Using crafted pickle (Step 1 would overwrite it!)
Step 2: Label and Convert
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python /Users/sotola/PycharmProjects/mac_local_m4/app/label_parsed_excel_file_new.py --day YYYY_MM_DD
Input: /db23/parsed_excel_files/YYYY_MM_DD.pickle
Output: /db23/df_new_day_numerical/YYYY_MM_DD.pickle, /db23/df_new_day_str/YYYY_MM_DD.pickle
Time: ~45-50 seconds
NaN check output (expected):
day 0 nan count
stock 0 nan count
stk 0 nan count
name 0 nan count
id 0 nan count
address 0 nan count
group 179134 nan count <- EXPECTED (filled later)
Step 3: Push to df_trans
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python /Users/sotola/PycharmProjects/mac_local_m4/gen_data/push_day_to_df_trans.py --day YYYY_MM_DD
Input: /db23/df_new_day_numerical/YYYY_MM_DD.pickle
Output: Updated /db23/pickles/df_trans_numerical.pickle
Time: ~15-20 seconds
Success indicator: SUCCEEDED! (/db23/pickles/df_trans_numerical.pickle)
Step 4: Upload to Remote
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python /Users/sotola/PycharmProjects/mac_local_m4/app/push_s2i_and_latest_day_to_remote.py --day YYYY_MM_DD
Uploads to: m3u (sotola@m3u:22222) Time: ~30-60 seconds (network dependent)
Step 5: Remote Compute + Verification
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/remote_compute_df_trans.py --day YYYY_MM_DD
Remote script: merge_to_df_trans.py (replaces old known_good_version.py)
Time: ~40-70 seconds (+ ~10s for verification)
CRITICAL: The --day flag is REQUIRED when re-processing existing days. Without it, the remote script auto-detects the "latest" day by filename, which may not be the day you're processing.
- New day (not in remote df_trans):
--dayoptional - Existing day (re-processing):
--dayREQUIRED
Integrated Verification: Step 5 now automatically verifies after merge:
- Local row count - reads
/db23/pickles/df_trans_numerical.pickle - Remote row count - reads remote df_trans via SSH
- NaN check - ensures no NaN in
stkcolumn (both local & remote) - Match check - local and remote row counts must be equal
Verification Output:
=== VERIFICATION: 2025_12_17 ===
=== LOCAL VERIFICATION ===
Local row count for 2025_12_17: 50,149
✓ Local stk column: No NaN values
=== REMOTE VERIFICATION ===
Remote row count for 2025_12_17: 50,149
✓ Remote stk column: No NaN values
=== VERIFICATION SUMMARY ===
✓ Local: 50,149 rows, no NaN
✓ Remote: 50,149 rows, no NaN
✓ Row counts MATCH: 50,149
=== ALL VERIFICATIONS PASSED ===
If verification fails:
- Local NaN in stk: Steps 1-4 were compromised - re-run from Step 1
- Remote NaN in stk: Remote merge issue - re-run Step 5
- Row count mismatch: Remote didn't process the uploaded day - check
--dayflag
Step 6: Sync to Kame
PYTHONHASHSEED=0 PYTHONPATH=/Users/sotola/PycharmProjects/mac_local_m4:$PATH \
python /Users/sotola/PycharmProjects/mac_local_m4/automated_pipelines/get_latest_day_from_m3u_then_push_to_kame.py
Syncs to: kame (kame@kamegoku:22)
Expected Timeouts
| Step | Normal | Recommended Timeout |
|---|---|---|
| Step 1 | 6s | 10s |
| Step 2 | 50s | 70s |
| Step 3 | 20s | 30s |
| Step 4 | 60s | 90s |
| Step 5 | 70s | 100s |
| Step 6 | 30s | 60s |
| Full pipeline | ~4 min | ~6 min |
Data Locations
| Path | Contents |
|---|---|
/home/ubuntu/Downloads/ | Input Excel files (5 DDMM.xlsx) |
/db23/parsed_excel_files/ | Per-day parsed pickles |
/db23/df_new_day_numerical/ | Daily numerical dataframes |
/db23/df_new_day_str/ | Daily string dataframes |
/db23/pickles/ | Master pickles (df_trans, s2i/i2s) |
Common Mistakes
| Mistake | Consequence | Prevention |
|---|---|---|
| Running Step 1 on crafted day | Overwrites crafted pickle with empty/wrong data | Skip Step 1 for crafted days |
| Missing PYTHONHASHSEED=0 | KeyError with large negative number | Always include in every command |
| Running steps in parallel | Data corruption | Run sequentially |
| Processing multiple days at once | Pickle conflicts | One day at a time |
| Step 5 without --day for existing day | Remote processes wrong day (latest by filename) | Always pass --day for historical days |
Communication Protocol
# On success
say "Day YYYY MM DD pipeline complete"
# On failure
say "Pipeline error on YYYY MM DD"
Remote Servers
| Server | SSH | Purpose |
|---|---|---|
| m3u | sotola@m3u:22222 | Primary compute |
| kame | kame@kamegoku:22 | Backup/sync |
Script Locations
| Script | Local | Remote |
|---|---|---|
merge_to_df_trans.py | /Users/sotola/PycharmProjects/remote_to_m3_ultra/workflows/ | Same path |
known_good_version.py | Same directory | Deprecated - ignored --day flag |
Incident Handling
This section documents resolved incidents and their solutions. When encountering similar issues, check here first.
Incident 2026_01_06: Complementary Data Overwrite (Day 2025_12_16)
Problem:
- Old data for 2025_12_16 had 102,122 rows of regular orders only (is_pt=0)
- New Excel file for 2025_12_16 had 51 rows of PT transactions only (Sheet1 empty, Sheet2 had 51 PT)
- Running the pipeline overwrote old data with new, losing 102,122 regular orders
Symptoms:
- After pipeline: Dec 16 had only 102 rows (51 PT × 2 for buyer+seller)
- Before pipeline: Dec 16 had 102,122 rows (all non-PT)
- Data was complementary, not replacement
Root Cause:
- Multi-day Excel file was split, but the split only contained PT data for Dec 16
- Step 1 overwrote the existing parsed pickle with PT-only data
- No backup was made before running
Resolution:
- Restore old Step 1 output from backup:
/tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle - Concatenate old (102,122 non-PT) + new (102 PT) with matching columns:
import pandas as pd df_old = pd.read_pickle('/tmp/agent-67dba-pickle-backup/parsed_excel_files/2025_12_16.pickle') df_new = pd.read_pickle('/db23/parsed_excel_files/2025_12_16.pickle') # Reorder columns to match, cast price dtype df_new = df_new[df_old.columns] df_new['price'] = df_new['price'].astype(df_old['price'].dtype) df_combined = pd.concat([df_old, df_new], ignore_index=True) df_combined.to_pickle('/db23/parsed_excel_files/2025_12_16.pickle') - Re-run Steps 2-6 for the day
- Verify: 102,224 rows (102,122 non-PT + 102 PT)
Prevention:
- Always backup before running pipeline on existing days
- Check new data source for completeness before processing
- Compare row counts: if new << old, investigate before proceeding
Key Lesson: Old and new data can be complementary (each has data the other doesn't), not just replacement (new supersedes old). When in doubt, backup and compare before overwriting.