Skip to content

Transformers

State-of-the-art Machine Learning for JAX, PyTorch and TensorFlow

pip3 install transformers
pip3 install transformers

NLP Course

pipeline

python
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
from transformers import pipeline

classifier = pipeline("sentiment-analysis")
classifier("I've been waiting for a HuggingFace course my whole life.")
shell
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]
[{'label': 'POSITIVE', 'score': 0.9598047137260437}]

By default, this pipeline selects a particular pretained model that has been fine-tuned for sentiment analysis in English. The model is downloaded and cached when you create the classifier object. If you rerun the command, the cached model will be used instead and there is no need to download the model again.

The cache folder is .cache/huggingface by default. You can customize your cache folder by setting the HF_HOME environment variable.

Dataset cache folder is ~/.cache/huggingface/datasets by default. you can customize your cache folder by setting the HF_HOME environment variable.

There are three main steps involved when you pass some text to a pipeline:

  1. The text is preprocessed into a format the model can understand.
  2. The preprocessed inputs are passed to the model.
  3. The predictions of the model are post-processed, so you can make sense of them.

Some of the currently available pipelines are:

  • feature-extraction (get the vector representative of a text)
  • fill-mask
  • ner (named entity recognition)
  • question-answering
  • sentiment-analysis
  • summarization
  • text-generation
  • translation
  • zero-shot-classification

Tokenizer

The atomic operations a tokneizer can handle: tokenization, conversion to IDs, and converting IDs back to string.

Batching is the act of sending multiple sentences through the model, all at once.

Padding makes sure all our sentences have the same length by adding a special word called padding token to the sentences with fewer values.

The padding token ID can be found in tokenizer.pad_token_id.

Attention masks are tensors with exact same shape as the input IDs tensor, filled with 0s and 1s: 1s indicate the corresponding tokens should be attended to, and 0s indicate the corresponding tokens should not be attended.

The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. This is because the model was pretrained with those, so to get the same results for inference we need to add them as well. Note that some models don’t add special words, or add different ones; models may also add these special words only at the beginning, or only at the end. In any case, the tokenizer knows which ones are expected and will deal with this for you.

Model heads: Making sense out of numbers

The model heads take the high-dimensional vector of hidden states as input and project them onto a different dimension. They are usually composed of one or a few linear layers:

The output of the Transformer model is sent directly to the model head to be processed.

transformer_and_head

Transformers-Tutorials

Transformers-Tutorials is a collection of demons with the Transformers library by HuggingFace.

table-transformer

https://github.com/microsoft/table-transformer

https://huggingface.co/microsoft/table-transformer-detectionhttps://huggingface.co/microsoft/table-transformer-structure-recognition

Hugging Face Serverless Inference API

The Serverless Inference API offers a fast and free way to explore thousands of models for a variety of tasks. Whether you’re prototyping a new application or experimenting with ML capabilities, this API gives you instant access to high-performing models across multiple domains.

PyTorch Model Format

model.safetensors and pytorch_model.bin.

References