26 Dec 2024

llm.c rough notes

I - Getting the defaults working

Fix linux-tools broken install with sudo dpkg -i --force-overwrite /var/cache/apt/archives/linux-tools-common_*.deb and then sudo apt --fix-broken install
Get nvcc (CUDA compiler driver) with sudo apt install nvidia-cuda-toolkit
Initial command ./train_gpt2fp32cu doesn’t work due to OOM issue.
Decreasing batch size and sequence length doesn’t work ./train_gpt2fp32cu -b 1 -t 512
See GPU usage stats with nvidia-smi and sudo fuser -v /dev/nvidia* and kill process using the significant GPU resources with kill -9 <pid>. In my case, it was ollama.
Finally get it running with ./train_gpt2fp32cu -b 1 -t 512

II - Fine tuning with a dataset (likes and bookmarks from personal twitter)

Use Twitter Web Exporter tampermonkey script to export likes and bookmarks from twitter into a csv file.
Make folder called tweets in dev/data and move csv files to that folder
Get only tweet text and make it into a big formatted text file
1. python -c "import csv; print('\n'.join(row[2] for row in csv.reader(open('twitter-Bookmarks-1724168766206.csv'))))" > bookmark.txt
2. python -c "import csv; print('\n'.join(row[2] for row in csv.reader(open('twitter-Likes-1724178823579.csv'))))" > like.txt
3. awk '{print $0 "\n"}' like.txt bookmark.txt > tweets.txt
Refactor tinyshakespeare.py into new file tweets.py to work on tweets

writing 32,768 tokens to /home/saahityaedams/workspace/llm.c/dev/data/tweets/tweets_val.bin (66,560 bytes) in the gpt-2 format
writing 247,119 tokens to /home/saahityaedams/workspace/llm.c/dev/data/tweets/tweets_train.bin (495,262 bytes) in the gpt-2 format

Run train_gpt2fp32cu with appropriate flags ie. ./train_gpt2fp32cu -b 1 -t 512 -i dev/data/tweets/tweets_train.bin -j dev/data/tweets/tweets_val.bin
Modify dev/eval/export_hf.py so that attn_implementation=eager in spin function. Also change the test prompt in same function to see new behaviour. Then run python dev/eval/export_hf.py -i gpt2_124M.bin -o gpt2_tweets