Text classification of non-equal length texts, should I pad left or right?

  • 5 Replies
  • 2230 Views
*

Dee

  • Nomad
  • ***
  • 94
  • AI rocks!
Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)


Should I pad the texts on the right:
thanks for     sharing --
this   article is      great
here's URL     --      --


Or, pad on the left:
--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?

*

infurl

  • Administrator
  • ***********
  • Eve
  • *
  • 1372
  • Humans will disappoint you.
    • Home Page
I don't know why you would need to pad the text at all. What sort of algorithm are you using?

edit: Ok I googled it. You can pad from either direction it doesn't matter. You have to do it both ways and see which one gets the best result. It seems as arbitrary and brittle as everything else in machine learning. :D

https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/

edit edit: the padding is to facilitate matrix multiplication of course

*

8pla.net

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1307
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
I was going to say, pad to the right, but infurl made me reconsider.
My Very Enormous Monster Just Stopped Using Nine

*

8pla.net

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1307
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.

NOTE: This code is for discussion purposes only:
Code
<?php
$text = preg_split("/[\s,]+/", "Thanks for sharing, this article is great, here's http://www.spammer.com");

foreach($text as $w)
   {
   if( preg_match("/http\:|www\.|\.com/i",$w))
     {
     echo "Spam: $w\n";
     }
   }
?>

Program Output:

S p a m :  h t t p : / / w w w . s p a m m e r . c o m
My Very Enormous Monster Just Stopped Using Nine

*

Dee

  • Nomad
  • ***
  • 94
  • AI rocks!
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.


uh yeah, i know the traditional approach to filter spams, i'm just trying to find whether AI can do it too as a personal experience in ML  O0

*

8pla.net

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1307
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
Yes, yes, yes... You have a point, Dat D!
And, machine learning, I think can do it.
More easily, than other machine learning
tasks, due to the underlying protocol.
Matching tokens like "http", "https" should
be reliable data for ML, I would think.
My Very Enormous Monster Just Stopped Using Nine

 


Requirements for functional equivalence to conscious processing?
by DaltonG (General AI Discussion)
November 19, 2024, 11:56:05 am
Will LLMs ever learn what is ... is?
by HS (Future of AI)
November 10, 2024, 06:28:10 pm
Who's the AI?
by frankinstien (Future of AI)
November 04, 2024, 05:45:05 am
Project Acuitas
by WriterOfMinds (General Project Discussion)
October 27, 2024, 09:17:10 pm
Ai improving AI
by infurl (AI Programming)
October 19, 2024, 03:43:29 am
Atronach's Eye
by WriterOfMinds (Home Made Robots)
October 13, 2024, 09:52:42 pm
Running local AI models
by spydaz (AI Programming)
October 07, 2024, 09:00:53 am
Hi IM BAA---AAACK!!
by MagnusWootton (Home Made Robots)
September 16, 2024, 09:49:10 pm
LLaMA2 Meta's chatbot released
by spydaz (AI News )
August 24, 2024, 02:58:36 pm
ollama and llama3
by spydaz (AI News )
August 24, 2024, 02:55:13 pm
AI controlled F-16, for real!
by frankinstien (AI News )
June 15, 2024, 05:40:28 am
Open AI GPT-4o - audio, vision, text combined reasoning
by MikeB (AI News )
May 14, 2024, 05:46:48 am
OpenAI Speech-to-Speech Reasoning Demo
by MikeB (AI News )
March 31, 2024, 01:00:53 pm
Say good-bye to GPUs...
by MikeB (AI News )
March 23, 2024, 09:23:52 am
Google Bard report
by ivan.moony (AI News )
February 14, 2024, 04:42:23 pm
Elon Musk's xAI Grok Chatbot
by MikeB (AI News )
December 11, 2023, 06:26:33 am

Users Online

440 Guests, 0 Users

Most Online Today: 467. Most Online Ever: 2369 (November 21, 2020, 04:08:13 pm)

Articles