Text classification of non-equal length texts, should I pad left or right?

  • 5 Replies
  • 392 Views
*

Dat D

  • Nomad
  • ***
  • 76
  • AI rocks!
Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)


Should I pad the texts on the right:
thanks for     sharing --
this   article is      great
here's URL     --      --


Or, pad on the left:
--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?

*

infurl

  • Administrator
  • *********
  • Terminator
  • *
  • 952
  • Humans will disappoint you.
    • Home Page
I don't know why you would need to pad the text at all. What sort of algorithm are you using?

edit: Ok I googled it. You can pad from either direction it doesn't matter. You have to do it both ways and see which one gets the best result. It seems as arbitrary and brittle as everything else in machine learning. :D

https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/

edit edit: the padding is to facilitate matrix multiplication of course

*

8pla.net

  • Trusty Member
  • **********
  • Millennium Man
  • *
  • 1227
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
I was going to say, pad to the right, but infurl made me reconsider.
My Very Enormous Monster Just Stopped Using Nine

*

8pla.net

  • Trusty Member
  • **********
  • Millennium Man
  • *
  • 1227
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.

NOTE: This code is for discussion purposes only:
Code
<?php
$text = preg_split("/[\s,]+/", "Thanks for sharing, this article is great, here's http://www.spammer.com");

foreach($text as $w)
   {
   if( preg_match("/http\:|www\.|\.com/i",$w))
     {
     echo "Spam: $w\n";
     }
   }
?>

Program Output:

S p a m :  h t t p : / / w w w . s p a m m e r . c o m
My Very Enormous Monster Just Stopped Using Nine

*

Dat D

  • Nomad
  • ***
  • 76
  • AI rocks!
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.


uh yeah, i know the traditional approach to filter spams, i'm just trying to find whether AI can do it too as a personal experience in ML  O0

*

8pla.net

  • Trusty Member
  • **********
  • Millennium Man
  • *
  • 1227
  • TV News. Pub. UAL (PhD). Robitron Mod. LPC Judge.
    • 8pla.net
Yes, yes, yes... You have a point, Dat D!
And, machine learning, I think can do it.
More easily, than other machine learning
tasks, due to the underlying protocol.
Matching tokens like "http", "https" should
be reliable data for ML, I would think.
My Very Enormous Monster Just Stopped Using Nine

 


Users Online

50 Guests, 1 User
Users active in past 15 minutes:
WriterOfMinds
[Trusty Member]

Most Online Today: 189. Most Online Ever: 528 (August 03, 2020, 06:16:11 AM)

Articles