llm.c rough notes
I - Getting the defaults working
-
Fix linux-tools broken install with
sudo dpkg -i --force-overwrite /var/cache/apt/archives/linux-tools-common_*.deband thensudo apt --fix-broken install -
Get nvcc (CUDA compiler driver) with
sudo apt install nvidia-cuda-toolkit -
Initial command
./train_gpt2fp32cudoesn’t work due to OOM issue. -
Decreasing batch size and sequence length doesn’t work
./train_gpt2fp32cu -b 1 -t 512 -
See GPU usage stats with
nvidia-smiandsudo fuser -v /dev/nvidia*and kill process using the significant GPU resources withkill -9 <pid>. In my case, it was ollama. -
Finally get it running with
./train_gpt2fp32cu -b 1 -t 512
II - Fine tuning with a dataset (likes and bookmarks from personal twitter)
- Use Twitter Web Exporter tampermonkey script to export likes and bookmarks from twitter into a csv file.
- Make folder called
tweetsindev/dataand move csv files to that folder - Get only tweet text and make it into a big formatted text file
python -c "import csv; print('\n'.join(row[2] for row in csv.reader(open('twitter-Bookmarks-1724168766206.csv'))))" > bookmark.txtpython -c "import csv; print('\n'.join(row[2] for row in csv.reader(open('twitter-Likes-1724178823579.csv'))))" > like.txtawk '{print $0 "\n"}' like.txt bookmark.txt > tweets.txt
- Refactor
tinyshakespeare.pyinto new filetweets.pyto work on tweets
writing 32,768 tokens to /home/saahityaedams/workspace/llm.c/dev/data/tweets/tweets_val.bin (66,560 bytes) in the gpt-2 format
writing 247,119 tokens to /home/saahityaedams/workspace/llm.c/dev/data/tweets/tweets_train.bin (495,262 bytes) in the gpt-2 format
- Run train_gpt2fp32cu with appropriate flags ie.
./train_gpt2fp32cu -b 1 -t 512 -i dev/data/tweets/tweets_train.bin -j dev/data/tweets/tweets_val.bin - Modify
dev/eval/export_hf.pyso that attn_implementation=eager inspinfunction. Also change the test prompt in same function to see new behaviour. Then runpython dev/eval/export_hf.py -i gpt2_124M.bin -o gpt2_tweets