Ai Dreams Forum

Member's Experiments & Projects => AI Programming => Topic started by: Dee on July 03, 2020, 07:55:33 am

Title: Text classification of non-equal length texts, should I pad left or right?
Post by: Dee on July 03, 2020, 07:55:33 am
Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:

thanks for sharing    [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL>          [2 tokens] --> 1 (Spam)


Should I pad the texts on the right:
thanks for     sharing --
this   article is      great
here's URL     --      --


Or, pad on the left:
--   thanks  for    sharing
this article is     great
--   --      here's URL

What are the pros and cons of either pad left or right?
Title: Re: Text classification of non-equal length texts, should I pad left or right?
Post by: infurl on July 03, 2020, 08:10:51 am
I don't know why you would need to pad the text at all. What sort of algorithm are you using?

edit: Ok I googled it. You can pad from either direction it doesn't matter. You have to do it both ways and see which one gets the best result. It seems as arbitrary and brittle as everything else in machine learning. :D

https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/ (https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/)

edit edit: the padding is to facilitate matrix multiplication of course
Title: Re: Text classification of non-equal length texts, should I pad left or right?
Post by: 8pla.net on July 03, 2020, 12:23:47 pm
I was going to say, pad to the right, but infurl made me reconsider.
Title: Re: Text classification of non-equal length texts, should I pad left or right?
Post by: 8pla.net on July 06, 2020, 06:21:17 am
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.

NOTE: This code is for discussion purposes only:
Code
<?php
$text = preg_split("/[\s,]+/", "Thanks for sharing, this article is great, here's http://www.spammer.com");

foreach($text as $w)
   {
   if( preg_match("/http\:|www\.|\.com/i",$w))
     {
     echo "Spam: $w\n";
     }
   }
?>

Program Output:

S p a m :  h t t p : / / w w w . s p a m m e r . c o m
Title: Re: Text classification of non-equal length texts, should I pad left or right?
Post by: Dee on July 06, 2020, 07:23:00 am
Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.


uh yeah, i know the traditional approach to filter spams, i'm just trying to find whether AI can do it too as a personal experience in ML  O0
Title: Re: Text classification of non-equal length texts, should I pad left or right?
Post by: 8pla.net on July 09, 2020, 12:07:27 am
Yes, yes, yes... You have a point, Dat D!
And, machine learning, I think can do it.
More easily, than other machine learning
tasks, due to the underlying protocol.
Matching tokens like "http", "https" should
be reliable data for ML, I would think.