Machine Learning

Model

monsoon-nlp/bert-base-thai is a Natural Language Processing (NLP) Model implemented in Transformer library, generally using the Python programming language that we used to make a pre-trained sentiment analysis model.

BERT-th presents the Thai-only pre-trained model based on the BERT-Base structure.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

BERT (Bidirectional Encoder Representations from Transformers) is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications.

ref : https://arxiv.org/pdf/1810.04805.pdf

Library that we use in model

  • Transformers - provides APIs and tools to easily download and train state-of-the-art pre-trained models

  • Datasets - a library for easily accessing and sharing datasets for Audio, Computer Vision, and Natural Language Processing (NLP) tasks

  • NumPy - a Python library used for working with arrays. It also has functions for working in domain of linear algebra, fourier transform, and matrices

Step to Preprocessing data

  • Cleaning data

    • Removing HTML, punctuation, emoji

  • Tokenization

  • Words to integer

  • Train and Test

SRC our COLAB : https://colab.research.google.com/drive/1p7GkPV8z_X71NpnnLV8kvp8hh-zi4z9z?usp=sharing

Last updated