Text classification of non-equal length texts, should I pad left or right?

Dee · « **on:** July 03, 2020, 07:55:33 am »

Text classification of equal length texts works without padding, but in reality, practically, texts never have the same length.

For example, spam filtering on blog article:
thanks for sharing [3 tokens] --> 0 (Not spam)
this article is great [4 tokens] --> 0 (Not spam)
here's <URL> [2 tokens] --> 1 (Spam)

Should I pad the texts on the right:
thanks for sharing --
this article is great
here's URL -- --

Or, pad on the left:
-- thanks for sharing
this article is great
-- -- here's URL

What are the pros and cons of either pad left or right?

infurl · « **Reply #1 on:** July 03, 2020, 08:10:51 am »

I don't know why you would need to pad the text at all. What sort of algorithm are you using?

edit: Ok I googled it. You can pad from either direction it doesn't matter. You have to do it both ways and see which one gets the best result. It seems as arbitrary and brittle as everything else in machine learning.

https://machinelearningmastery.com/data-preparation-variable-length-input-sequences-sequence-prediction/

edit edit: the padding is to facilitate matrix multiplication of course

8pla.net · « **Reply #2 on:** July 03, 2020, 12:23:47 pm »

I was going to say, pad to the right, but infurl made me reconsider.

8pla.net · « **Reply #3 on:** July 06, 2020, 06:21:17 am »

Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.

NOTE: This code is for discussion purposes only:

Code

<?php
$text = preg_split("/[\s,]+/", "Thanks for sharing, this article is great, here's http://www.spammer.com");

foreach($text as $w)
   {
   if( preg_match("/http\:|www\.|\.com/i",$w))
     {
     echo "Spam: $w\n";
     }
   }
?>

Program Output:

S p a m : h t t p : / / w w w . s p a m m e r . c o m

Dee · « **Reply #4 on:** July 06, 2020, 07:23:00 am »

Quote from: 8pla.net on July 06, 2020, 06:21:17 am

Perhaps, instead consider using regular expressions.
This splits the string into tokens and checks to see
if any token breaks the, no links policy, on the blog.

uh yeah, i know the traditional approach to filter spams, i'm just trying to find whether AI can do it too as a personal experience in ML

8pla.net · « **Reply #5 on:** July 09, 2020, 12:07:27 am »

Yes, yes, yes... You have a point, Dat D!
And, machine learning, I think can do it.
More easily, than other machine learning
tasks, due to the underlying protocol.
Matching tokens like "http", "https" should
be reliable data for ML, I would think.

Text classification of non-equal length texts, should I pad left or right?

Dee

Text classification of non-equal length texts, should I pad left or right?

infurl

Re: Text classification of non-equal length texts, should I pad left or right?

8pla.net

Re: Text classification of non-equal length texts, should I pad left or right?

8pla.net

Re: Text classification of non-equal length texts, should I pad left or right?

Dee

Re: Text classification of non-equal length texts, should I pad left or right?

8pla.net

Re: Text classification of non-equal length texts, should I pad left or right?

Recent Topics

Recent News

Users Online

Articles