llm.c rough notes
I - Getting the defaults working
-
Fix linux-tools broken install with
sudo dpkg -i --force-overwrite /var/cache/apt/archives/linux-tools-common_*.deb
and thensudo apt --fix-broken install
-
Get nvcc (CUDA compiler driver) with
sudo apt install nvidia-cuda-toolkit
-
Initial command
./train_gpt2fp32cu
doesn’t work due to OOM issue. -
Decreasing batch size and sequence length doesn’t work
./train_gpt2fp32cu -b 1 -t 512
-
See GPU usage stats with
nvidia-smi
andsudo fuser -v /dev/nvidia*
and kill process using the significant GPU resources withkill -9 <pid>
. In my case, it was ollama. -
Finally get it running with
./train_gpt2fp32cu -b 1 -t 512
II - Fine tuning with a dataset (likes and bookmarks from personal twitter)
- Use Twitter Web Exporter tampermonkey script to export likes and bookmarks from twitter into a csv file.
- Make folder called
tweets
indev/data
and move csv files to that folder - Get only tweet text and make it into a big formatted text file
python -c "import csv; print('\n'.join(row[2] for row in csv.reader(open('twitter-Bookmarks-1724168766206.csv'))))" > bookmark.txt
python -c "import csv; print('\n'.join(row[2] for row in csv.reader(open('twitter-Likes-1724178823579.csv'))))" > like.txt
awk '{print $0 "\n"}' like.txt bookmark.txt > tweets.txt
- Refactor
tinyshakespeare.py
into new filetweets.py
to work on tweets
writing 32,768 tokens to /home/saahityaedams/workspace/llm.c/dev/data/tweets/tweets_val.bin (66,560 bytes) in the gpt-2 format
writing 247,119 tokens to /home/saahityaedams/workspace/llm.c/dev/data/tweets/tweets_train.bin (495,262 bytes) in the gpt-2 format
- Run train_gpt2fp32cu with appropriate flags ie.
./train_gpt2fp32cu -b 1 -t 512 -i dev/data/tweets/tweets_train.bin -j dev/data/tweets/tweets_val.bin
- Modify
dev/eval/export_hf.py
so that attn_implementation=eager inspin
function. Also change the test prompt in same function to see new behaviour. Then runpython dev/eval/export_hf.py -i gpt2_124M.bin -o gpt2_tweets