git+https://github.com/huggingface/transformers.git datasets sentencepiece PyPDF2 pdfminer.six pdfplumber poppler-utils tesseract-ocr libtesseract-dev