`Categorical Cross Entropy`

, which is the loss function for this neural network, makes it usable for Sentiment Analysis. Cross Entropy loss returns probabilities for different classes. In your case, you need probabilities for two possible classes (`0`

or`1`

).- I am not sure if you are using a tokenizer since it is not apparent from the code you provided but if you are, then yes, it is a Bad of words model. A Bag of words model essentially creates a storage for the word roots you have in your text. From Wikipedia, if the following is your text:

John likes to watch movies. Mary likes movies too.

then, a BoW for this text would be:

{“John”:1,”likes”:2,”to”:1,”watch”:1,”movies”:2,”Mary”:1,”too”:1};

- The network architecture you are using is not Convolutional, rather it is a feedforward model, which connects all units from one layer to all the units in the next, providing a dot product of the values from the two layers.
- There is no one accepted definition of a network being
*deep*. But, as a rule of thumb, if a network has more than 2 middle layers (layers excluding the input and output layer), then it can be considered as a*deep*network. - In the code provided above,
`Dense`

reflects to the fact that all units in the first layer (512) are connected to every other unit in the next layer, i.e., a total of 512×256 connections between first layer and the second. - Yes, the connections between the 512 units in the first layer to the 256 units in the second layer resulting in a 512×256 dimensional matrix of parameters makes it
*dense*. But the usage of`Dense`

here is more from an API perspective rather than semantic. Similarly, the parameter matrix between the second and third layer would be 256×2 dimensional. - If you exclude the input layer (having 512 units) and output layer (having 2 possible outputs, i.e., 0/1), then your network here has one layer, with 256 units.
- This model is supervised, since the sentiment analysis task has an output (positive or negative) associated with every input data point. You can see this output as being a
*supervisor*to the network indicating it whether a data point has a positive or negative sentiment. An unsupervised task does not have an output signal associated with the data points. - The activation functions being used here serve the purpose of providing nonlinearity to the network’s computations. In a little more detail,
`sigmoid`

has a nice property that its output can be interpreted as probabilities. So if the network is outputting 0.89 for a data point, then it would mean that the model evaluates that data point to be positive with a probability of 0.89 . The usage of sigmoid is probably for teaching purposes since ReLU activation units are favored over sigmoid/tanh because of better convergence properties and I don’t see a convincing reason to use sigmoid instead of ReLU.

CLICK HERE to find out more related problems solutions.