--- datasets: - zalando-datasets/fashion_mnist language: - en metrics: - accuracy pipeline_tag: image-classification tags: - fashion - clothes - fashion_mnist - CNN - Classification --- # BeitForImageClassification ## Model Structure ### BeitModel - **Embeddings: BeitEmbeddings** - Uses patch embeddings with a `Conv2d` layer (3 input channels, 768 output channels, kernel size 16x16, stride 16x16). - Includes a dropout layer with probability 0.0. - **Encoder: BeitEncoder** - Contains 12 `BeitLayer` modules. - Each `BeitLayer` includes: - **Attention: BeitAttention** - `BeitSelfAttention` with linear layers for query, key, and value, dropout, and relative position bias. - `BeitSelfOutput` with a linear layer and dropout. - **Intermediate: BeitIntermediate** - Dense layer increasing dimensions from 768 to 3072, followed by GELU activation. - **Output: BeitOutput** - Dense layer reducing dimensions back to 768, with dropout. - **LayerNorm** applied before and after main operations. - **Drop Path** mechanism with varying probability across layers. - **Pooler: BeitPooler** - Contains a layer normalization. ### Classifier: Linear - Linear layer mapping 768-dimensional embeddings to 10 output classes. ## Detected Classes The model has been trained to detect the following classes: 1. T-shirt / top 2. Trouser 3. Pullover 4. Dress 5. Coat 6. Sandal 7. Shirt 8. Sneaker 9. Bag 10. Ankle boot