Text classification of non-equal length texts, should I pad left or right?

  • 5 Replies
  • 1762 Views
*

Dee

  • Nomad
  • ***
  • 94
  • AI rocks!
Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)


Should I pad the texts on the right:
thanks for     sharing --
this   article is      great
here's URL     --      --


Or, pad on the left:
--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?

*

infurl

  • Administrator
  • ***********
  • Eve
  • *
  • 1365
  • Humans will disappoint you.
    • Home Page
I don't know why you would need to pad the text at all. What sort of algorithm are you using?

edit: Ok I googled it. You can pad from either direction it doesn't matter. You have to do it both ways and see which one gets the best result. It seems as arbitrary and brittle as everything else in machine learning. :D

https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/

edit edit: the padding is to facilitate matrix multiplication of course

*

8pla.net

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1302
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
I was going to say, pad to the right, but infurl made me reconsider.
My Very Enormous Monster Just Stopped Using Nine

*

8pla.net

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1302
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.

NOTE: This code is for discussion purposes only:
Code
<?php
$text = preg_split("/[\s,]+/", "Thanks for sharing, this article is great, here's http://www.spammer.com");

foreach($text as $w)
   {
   if( preg_match("/http\:|www\.|\.com/i",$w))
     {
     echo "Spam: $w\n";
     }
   }
?>

Program Output:

S p a m :  h t t p : / / w w w . s p a m m e r . c o m
My Very Enormous Monster Just Stopped Using Nine

*

Dee

  • Nomad
  • ***
  • 94
  • AI rocks!
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.


uh yeah, i know the traditional approach to filter spams, i'm just trying to find whether AI can do it too as a personal experience in ML  O0

*

8pla.net

  • Trusty Member
  • ***********
  • Eve
  • *
  • 1302
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
Yes, yes, yes... You have a point, Dat D!
And, machine learning, I think can do it.
More easily, than other machine learning
tasks, due to the underlying protocol.
Matching tokens like "http", "https" should
be reliable data for ML, I would think.
My Very Enormous Monster Just Stopped Using Nine

 


OpenAI Speech-to-Speech Reasoning Demo
by ivan.moony (AI News )
Today at 01:31:53 pm
Say good-bye to GPUs...
by MikeB (AI News )
March 23, 2024, 09:23:52 am
Google Bard report
by ivan.moony (AI News )
February 14, 2024, 04:42:23 pm
Elon Musk's xAI Grok Chatbot
by MikeB (AI News )
December 11, 2023, 06:26:33 am
Nvidia Hype
by 8pla.net (AI News )
December 06, 2023, 10:04:52 pm
How will the OpenAI CEO being Fired affect ChatGPT?
by 8pla.net (AI News )
December 06, 2023, 09:54:25 pm
Independent AI sovereignties
by WriterOfMinds (AI News )
November 08, 2023, 04:51:21 am
LLaMA2 Meta's chatbot released
by 8pla.net (AI News )
October 18, 2023, 11:41:21 pm

Users Online

290 Guests, 1 User
Users active in past 15 minutes:
8pla.net
[Trusty Member]

Most Online Today: 335. Most Online Ever: 2369 (November 21, 2020, 04:08:13 pm)

Articles