SentenceTransformer based on NovaSearch/stella_en_400M_v5

This is a sentence-transformers model finetuned from NovaSearch/stella_en_400M_v5. It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.

Model Details

Model Description

  • Model Type: Sentence Transformer
  • Base model: NovaSearch/stella_en_400M_v5
  • Maximum Sequence Length: 512 tokens
  • Output Dimensionality: 1024 dimensions
  • Similarity Function: Cosine Similarity

Model Sources

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: NewModel 
  (1): Pooling({'word_embedding_dimension': 1024, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
  (2): Dense({'in_features': 1024, 'out_features': 1024, 'bias': True, 'activation_function': 'torch.nn.modules.linear.Identity'})
)

Usage

Direct Usage (Sentence Transformers)

First install the Sentence Transformers library:

pip install -U sentence-transformers

Then you can load this model and run inference.

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("sentence_transformers_model_id")
# Run inference
sentences = [
    'Effect of input stimulus coding on self-supervised learning performance',
    "INTRODUCTION Temporal difference (TD) planning [6, 7] uses prediction for control. Consider an agent moving around a finite grid such as the one in figure 1 (the agent is incapable of crossing the barrier) trying to reach a goal whose position it does not know. If it can predict how far away from the goal it is at the current step, and how far away from the goal it is at the next step, after making a move, then it can decide whether or not that move was helpful or harmful. If, in addition, it can record this fact, then it can learn how to navigate to the goal. This generation of actions from predictions is closely related to the mechanism of dynamical programming. TD is used to learn the predictions in the first place. Consider the agent moving around randomly on the grid, receiving a negative reinforcement of -1 for every move it makes apart from moves which take it onto the goal. In this case, if it can estimat.e from every location it visits, how much reinforcement (discounted by how soon it arrives) it will get before it next reaches the goal, it will be predicting how far away it is, based on the random method of selecting actions. TD's mechanism of learning is to force the predictions to be consistent; the prediction from location a should be -1 more than the average of the predictions from the locations that can be reached in one step (hence the extra -1 reinforcement) from a. 464 Navigating Through Temporal Difference 465 If the agent initially selects each action with the same probability, then the estimate of future reinforcement from a will be monotonically related to how many steps a is away from the goal. This makes the predictions useful for criticising actions as above. In practice, the agent will modify its actions according to this criticism at the same time as learning the predictions based on those actions. Barto, Sutton and Watkins [2] develop this example, and show how the TD mech anism coupled with a punctate representation of the stimulus (referred to as'RBsw below) finds the optimal paths to the goal. 'RBsw ignores the cues shown in figure 1, and devotes one input unit to each location on the grid, which fires if and only if the agent is at that place. TD methods can however work with more general codes. Section 2 considers al ternative representations, including ones that are sensitive to the orientation of the agent as it moves through the grid, and section 3 looks at a restricted form of la. tent learning - what the agent can divine about its environment in the absence of reinforcement. Both techniques can improve the speed of learning. 2 ALTERNATE REPRESENTATIONS Stimulus representations, the means by which the agent finds out from the environ ment where it is, can be classified along two dimensions; whether they are punctate or distributed, and whether they are directionally sensitive or in register with the world. Over most of the grid, a 'sensible' distributed representation, such as a coarse-coded one, would be expected to make learning faster, as information about the value and action functions could be shared across adjacent grid points. There are points of discontinuity in the actions, as in the region above the right hand arm of the barrier, but they are few. In his PhD thesis [9], Watkins considers a rather similar problem to that in figure I, and solves it using his variant ofTD, Q-Iearning, based on a CMAC [1] coarse-coded representation of the space. Since his agent moves in a continuous bounded space, rather than being confined merely to discrete grid points, something of this sort is anyway essential. After the initial learning, Watkins arbitrarily makes the agent move ten times more slowly in a closed section of the space. This has a similar effect to the barrier in inducing a discontinuity in the action space. Despite the CMACS forcing the system to share information across such discontinuities, they were able to learn the task quickly. The other dimension over which representations may vary involves the extent to which they are sensitive to the direction in which the agent is facing. This is of interest if the agent must construe its location from the cues around the grid. In this case, rather than moving North, South, East or West, which are actions registered with the world, the agent should only move Ahead, Left or Right (Behind is disabled as an additional constraint), whose effects are also orientation dependent. This, together with the fact that the representation will be less compact (it having a larger input dimensionality) should make learning slower. Dynamical programming and its equivalents are notoriously subject to Bellman's curse of dimensionality, an engineering equivalent of exponential explosion in search. Table 1 shows four possible representations classified along these two dimensions. 466 Dayan Coarse ness Directionally Punctate Distributed Sensltlve R,x RA Insensltlve 'RBSW 'RCMAC Table 1: Representations. 'RBSW is the representation Barto, Sutton and Watkins used. R,x is punctate and directionally sensitive - it devotes four units to every grid point, one of which fires for each possible orientation of the agent. 'RcIAC' the equivalent of Watkins' representation, was not simulated, because its capabilities would not differ markedly from those of the mapping-based representation developed in the next section. nA is rather different from the other representations; it provides a test of a represen tation which is more directly associated with the sensory information that might be available directly from the cues. Figure 2 shows how 'RA works. Various identifiable cues, C 1 ... C c (c  7 in the figure) are scattered around the outside of the grid, and the agent has a fictitious 'retina' which rotates with it. This retina is divided into a number of angular buckets (8 in the figure), and each bucket has c units, the iSh one of which responds if the cue Ci is visible in that bucket. This representation is clearly directionally sensitive (if the agent is facing a different way, then so is its retina, and so no cue will be visible in the same bucket as it was before), and also distributed, since in general more than one cue will be visible from every location. Note that there is no restriction on the number of units that can fire in each bucket at any time - more than one will fire if more than one cue is visible there. Also, under the present system 'RA will in general not work if its coding is ambiguous - grid points must be distinguishable. Finally, it should be clear that 'RA is not biologically plausible. Figure 3 shows the learning curves for the three representations simulated. Each point is generated by switching off the learning temporarily after a certain number of iterations, starting the agent from everywhere in the grid, and averaging how many steps it takes in getting to the goal over and above the minimum necesary. It is apparent that n.x is substantially worse, but, surprisingly, that 'RA is actually better than 'RBSW . This implies that the added advantage of its distributed na ture more than outweighs its disadvantages of having more components and being directionally sensitive. One of the motivations behind studying alternate representations is the experimen tal findings on place cells in the hippocampi of rats (amongst other species). These are cells that fire only when the rat is at a certain location in its environment. Although their existence has led to many hypotheses about rat cognitive mapping (see [5J for a substantial discussion of place cells and mapping), it is important to note that even with a map, there remains the computational1y intensive problem of navigation addressed, in this paper, by TD. 'RA, being closely related to the input stimuli is quite unlike a place cell code - the other representations all bear some similarities. Navigating Through Temporal Difference 467 3 GOAL-FREE LEARNING One of the problems with the TD system as described is that it is incapable oflatent learning in the absence of reinforcement or a goal. If the goal is just taken away, but the -1 reinforcements are still applied at each step, then the values assigned to each location will tend to -00. If both are removed, then although the agent will wander about its environment with random gay abandon, it will not pick up anything that could be used to speed subsequent learning. Latent learning experiments with rats in dry mazes prove fairly conclusively that rats running mazes in the absence of rewards and punishments learn almost as much as rats that are reinforced. One way to solve this problem is suggested by Sutton's DYNA architecture [7]. Briefly, this constructs a map of place x action - next place, and takes steps in the fictitious world constructed from its map in-between taking steps in the real world, as a way of ironing out the computational 'bumps' (ie inconsistencies) in the value and action functions. Instead, it is possible to avoid constructing a complete map by altering the repre sentation of the environment used for learning the prediction function and optimal actions. The section on representations concluded that coarse-coded representations are generally better than punctate ones, since information can be shared between neighbouring points. However, not all neighbouring points are amenable to this sharing, because of discontinuities in the value and action functions. If there were a way of generating a coarse coded representation (generally from a punctate one) that is sensitive to the structure of the task, rather than arbitrarily assigned by the environment, it should provide the base for faster learning still. In this case, neighbouring points should only be coded together if they are not separated by the barrier. The initial exploration would allow the agent to learn this much about the structure of the environment. Consider a set of units whose job is to predict the future discounted sum of firings of the raw input lines. Using 'R.Bsw during the initial stage of learning when the act.ions are still random, if the agent is at location (3,3) of the grid, say, then the discounted prediction of how often it will be in (3,4) (ie the frequency with which the single unit representing (3,4) will fire) will be high, since this location is close. However, the prediction for (7,11) will be low, because it is very unlikely to get there quickly. Consider the effect of the barrier: locations on opposite sides of it, eg (1,6) and (2,6), though close in the Euclidean (or Manhattan) metric on the grid, are far apart in the task. This means that the discounted prediction of how often the agent will be at (1,6) given that it starts at (2,6), will be proportionately lower. Overall, the prediction units should act like a coarse code, sensitive to the struc ture of the task. As required, this information about the environment is entirely independent of whether or not the agent is reinforced during its exploration. In fact, the resulting 'map' will be more accurate if it is not, as its exploration will be more random. The output of the prediction units is taken as an additional source of information for the value and action functions. Since their main aim is to create intelligently distributed representations from punc tate ones, it is only appropriate to use these prediction units for 'RBsw and 'R4X ' Figure 4 compares average learning curves for 'RBsw with and without these ex-468 Dayan tra mapping units, and with and without 6000 steps of latent learning (LL) in the absence of any reinforcement. A significant improvement is apparent. Figure 5 shows one set of predictions based on the 1lBsw representation! after a few un-reinforced iterations. The predictions are clearly fairly well developed and smooth - a predictable exponentially decaying hump. The only deviations from this are at the barrier and along the edges, where the effects of impermeability and immobility are apparent. Figure 6 shows the same set of predictions but after 2000 reinforced iterations, by which time the agent reaches the goal almost optimally. The predictions degenerate from being roughly radially symmetric (bar the barrier) to being highly asymmetric. Once the agent has learnt how to get to the goal from some location, the path it will follow, and so the locations it will visit from there, is largely fixed. The asymptotic values of the predictions will therefore be 0 for units not on the path, and -( for those on the path, where r is the number of steps since the agent's start point and 'Y is the discounting factor weighting immediate versus distant reinforcement. This is a severe limitation since it implies that the topological information present in the early stages of learning disappears evaporates, and with it almost all the benefits of the prediction units. 4 DISCUSSION Navigation comprises two problems; where the agent and the goals in its environ ment are, and how it can get to them. Having some form of cognitive map, as is suggested by the existence of place cells, addresses the first, but leaves open the second. For the case of one goal, the simple TD method described here is one solution. TD planning methods are clearly robust to changes in the way the input stimu lus is represented. Distributed codes, particularly ones that allow for the barrier, make learning faster. This is even true for 1lA' which is sensitive to the orientation of the agent. All these results require each location to have a unique representa tion - Mozer and Bachrach [4] and Chrisley [3] and references therein look at how ambiguities can be resolved using information on the sequence of states the agent traverses. Since these TD planning methods are totally general, just like dynamical program ming, they are unlikely to scale well. Some evidence for this comes from the rel atively poor performance of 1l.x , with its quadrupled input dimension. This puts the onus back either onto dividing the task into manageable chunks, or onto more sophisticated representation. A cknow ledgements I am very grateful to Jay Buckingham, Kate Jeffrey, Richard Morris, Toby Tyrell, David Willshaw, and the attendees of the PDP Workshop at Edinburgh, the Con nectionist Group at Amherst, and a spatial learning workshop at King's College Cambridge for their helpful comments. This work was funded by SERC. 1 Note that these are normalised to a maximum value of 10, for graphical convenience. Navigating Through Temporal Difference 469 References [1] Albus, JS (1975). A new approach to manipulator control: the Cerebellar Model Articulation Controller (CMAC). Transactions of the ASME: Journal of Dynamical Systems, Measurement and Control, 97, pp 220-227. [2] Barto, AG, Sutton, RS . Watkins, CJCH (1989). Learning and Sequential Decision Making. Technical Report 89-95, Computer and Information Science, University of Massachusetts, Amherst, MA. [3] Chrisley, RL (1990). Cognitive map construction and use: A parallel dis tributed approach. In DS Touretzky, J Elman, TJ Sejnowski, . GE Hinton, editors, Proceedings of the 1990 Con nectionist M odds Summer School. San Mateo, CA: Morgan Kaufmann. [4] Mozer, MC, . Bachrach, J (1990). Discovering the structure of a reactive en vironment by exploration. In D Touretzky, editor, Advances in Neurallnfor mation Processing Systems, , pp 439-446. San Mateo, CA: Morgan Kaufmann. [5] O'Keefe, J  Nadel, L (1978). The Hippocampus as a Cognitive Map. Oxford, England: Oxford University Press. [6] Sutton, RS (1988). Learning to predict by the methods of temporal difference. Machine Learning, 3, pp 9-44. [7] Sutton, RS (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic progranuning. In Proceedings of the Seventh International Conference on Machine Learning. San Mateo, CA: Morgan Kauf [8] Sutton, RS, . Barto, AG. To appear. Time-derivative models of Pavlovian conditioning. In M Gabriel . JW Moore, editors, Learning and Computational Neuroscience. Cambridge, MA: MIT Press. [9J Vatkins, CJCH (1989). Learning from Delayed Rewards. PhD Thesis. Univer sity of Cambridge, England. Agall arrier OriCIIlltloD 'Retina' Anplar bucket Dot rlrina 1. flrina Fig 2: The 'retina' for 1lA 470 Dayan Average extra steps to goal Learning iterations Fig 3: Different representations Fig 5: Initial predictions from (5,6) Average extra steps to goal Learning iterations Fig 4: Mapping with 'RBSW Fig 6: Predictions after 2000 iterations",
    "Introduction Hand-written digit recognition has become one of the touchstone problems in neural networks recently. Large databases of training examples such as the NIST (National Institute of Standards and Technology) Special Database 3 have become available, and real-world applications with clear practical value, such as recognizing zip codes in letters, have emerged. Diverse architectures with varying learning rules have been proposed, including feed-forward networks (Denker et al. 1989; Ie Cun et al. 1990; Martin and Pittman 1990), self-organizing maps (Allinson et al. 1994), and dedicated approaches such as the neocognitron (Fukushima and Wake 1990). The problem is difficult because handwriting varies a lot, some digits are easily confusable, and recognition must be based on small but crucial differences. For ex ample, the digits 3 and 8, 4 and 9, and 1 and 7 have several overlapping segments, and the differences are often lost in the noise. Thus, hand-written digit recogni tion can be seen as a process of identifying the distinct features and producing an internal representation where the significant differences are magnified, making the recognition easier. Laterally Interconnected Self-organizing Maps in Handwritten Digit Recognition 737 In this paper, the Laterally Interconnected Synergetically Self-Organizing Map ar chitecture (LISSOM; Sirosh and Miikkulainen 1994, 1995, 1996) was employed to form such a separable representation. The lateral inhibitory connections of the LIS SOM map decorrelate features in the input, retaining only those differences that are the most significant. Using LISSOM as a front end, the actual recognition can be performed by any standard neural network architecture, such as the perceptron. The experiments showed that while direct recognition of the digit bitmaps with a simple percept ron network is successful 72.3 of the time, and recognizing them using a standard self-organizing map (SOM) as the front end 84.1 of the time, the recognition rate is 88.1  based on the LISSOM network. These results suggest that LISSOM can serve as an effective front end for real-world handwritten character recognition systems. 2 The Recognition System 2.1 Overall architecture The system consists of two networks: a 20 x 20 LISSOM map performs the feature analysis and decorrelation of the input, and a single layer of 10 perceptrons the final recognition (Figure 1 (a)). The input digit is represented as a bitmap on the 32 x 32 input layer. Each LISSOM unit is fully connected to the input layer through the af ferent connections, and to the other units in the map through lateral excitatory and inhibitory connections (Figure 1 (b)). The excitatory connections are short range, connecting only to the closest neighbors of the unit, but the inhibitory connections cover the whole map . The percept ron layer consists of 10 units, corresponding to digits 0 to 9. The perceptrons are fully connected to the LISSOM map, receiv ing the full activation pattern on the map as their input. The perceptron weights are learned through the delta rule, and the LISSOM afferent and lateral weights through Hebbian learning. 2.2 LISSOM Activity Generation and Weight Adaptation The afferent and lateral weights in LISSOM are learned through Hebbian adapta tion. A bitmap image is presented to the input layer, and the initial activity of the map is calculated as the weighted sum of the input. For unit (i, j), the initial response TJij IS where eab is the activation of input unit (a, b), Ilij ,ab is the afferent weight connecting input unit ( a, b) to map unit (i, j), and (7 is a piecewise linear approximation of the sigmoid activation function. The activity is then settled through the lateral connections. Each new activity TJij (t) at step t depends on the afferent activation and the lateral excitation and inhibition: where Eij,kl and Iij,kl are the excitatory and inhibitory connection weights from map unit (k, l) to (i, j) and TJkl(t - 1) is the activation of unit (k, I) during the previous time step. The constants Ie and Ii control the relative strength of the lateral excitation and inhibition. After the activity has settled, the afferent and lateral weights are modified according to the Hebb rule. Afferent weights are normalized so that the length of the weight 738 Y. CHOE, J. SIROSH, R. MIIKKULAINEN Output Layer (10) tII'd Units with excitatory lateral connections to (iJ) Units with inhibitory lateral connections to (iJ) Figure 1: The system architecture. (a) The input layer is activated according to the bitmap image of digit 6. The activation propagates through the afferent connections to the LISSOM map, and settles through its lateral connections into a stable pattern. This pattern is the internal representation of the input that is then recognized by the perceptron layer. Through ,the connections from LISSOM to the perceptrons, the unit representing 6 is strongly activated, with weak activations on other units such as 3 and 8. (b) The lateral connections to unit (i, j), indicated by the dark square, are shown. The neighborhood of excitatory connections (lightly shaded) is elevated from the map for a clearer view. The units in the excitatory region also have inhibitory lateral connections (indicated by medium shading) to the center unit. The excitatory radius is 1 and the inhibitory radius vector remains the same; lateral weights are normalized to keep the sum of weights constant (Sirosh and Miikkulainen 1994): IllJ,mn - VLmn[llij,mn(t)  crinp1]ijmnF' (3) where Ilij,mn is the afferent weight from input unit (m, n) to map unit (i, j), and crinp is the input learning rate; Wij ,kl is the lateral weight (either excitatory Eij ,kl or inhibitory Iij ,kl) from map unit (k, I) to (i, j), and cr is the lateral learning rate (either crexc or crinh). 2.3 Percept ron Output Generation and Weight Adaptation The perceptrons at the output of the system receive the activation pattern on the LISSOM map as their input. The perceptrons are trained after the LISSOM map has been organized. The activation for the perceptron unit Om is where C is a scaling constant, 1]ij is the LISSOM map unit (i,j), and Vij,m is the connection weight between LISSOM map unit (i,j) and output layer unit m. The delta rule is used to train the perceptrons: the weight adaptation is proportional to the map activity and the difference between the output and the target: where crout is the learning rate of the percept ron weights, 1]ij is the LISSOM map unit activity, (m is the target activation for unit m. ((m  1 if the correct digit m, 0 otherwise). Laterally Interconnected Self-organizing Maps in Handwritten Digit Recognition 739 I Representation I Training Test Table 1: Final Recognition Results. The average recognition percentage and its variance over the 10 different splits are shown for the training and test sets. The differences in each set are statistically significant with p  .9999. 3 Experiments A subset of 2992 patterns from the NIST Database 3 was used as training and testing data.1 The patterns were normalized to make sure taht each example had an equal effect on the LISSOM map (Sirosh and Miikkulainen 1994). LISSOM was trained with 2000 patterns. Of these, 1700 were used to train the perceptron layer, and the remaining 300 were used as the validation set to determine when to stop training the perceptrons. The final recognition performance of the whole system was measured on the remaining 992 patterns, which neither LISSOM nor the perceptrons had seen during training. The experiment was repeated 10 times with different random splits of the 2992 input patterns into training, validation, and testing sets. The LISSOM map can be organized starting from initially random weights. How ever, if the input dimensionality is large, as it is in case of the 32 X 32 bitmaps, each unit on the map is activated roughly to the same degree, and it is difficult to bootstrap the self-organizing process (Sirosh and Miikkulainen 1994, 1996). The standard Self-Organizing Map algorithm can be used to preorganize the map in this case. The SOM performs preliminary feature analysis of the input, and forms a coarse topological map of the input space. This map can then be used as the starting point for the LISSOM algorithm, which modifies the topological organi zation and learns lateral connections that decorrelate and represent a more clear categorization of the input patterns. The initial self-organizing map was formed in 8 epochs over the training set, grad ually reducing the neighborhood radius from 20 to 8. The lateral connections were then added to the system, and over another 30 epochs, the afferent and lateral weights of the map were adapted according to equations 3 and 4. In the beginning, the excitation radius was set to 8 and the inhibition radius to 20. The excitation radius was gradually decreased to 1 making the activity patterns more concentrated and causing the units to become more selective to particular types of input pat terns. For comparison, the initial self-organized map was also trained for another 30 epochs, gradually decreasing the neighborhood size to 1 as well. The final afferent weights for the SOM and LISSOM maps are shown in figures 2 and 3. After the SOM and LISSOM maps were organized, a complete set of activation patterns on the two maps were collected. These patterns then formed the training input for the perceptron layer. Two separate versions were each trained for 500 epochs, one with SOM and the other with LISSOM patterns. A third perceptron layer was trained directly with the input bitmaps as well. Recognition performance was measured by counting how often the most highly ac tive perceptron unit was the correct one. The results were averaged over the 10 different splits. On average, the final LISSOMperceptron system correctly recog nized 88.1 of the 992 pattern test sets. This is significantly better than the 84.1 1 Downloadable at ftp:j jsequoyah.ncsl.nist.gov jpubjdatabasesj. 740 Y . CHOE, J. SIROSH, R. MIIKKULAINEN Figure 2: Final Afferent Weights of the SOM map . The digit-like patterns represent the afferent weights of each map unit projected on the input layer. For example, the lower left corner represents the afferent weights of unit (0,0). High weight values are shown in black and low in white. The pattern of weights shows the input pattern to which this unit is most sensitive (6 in this case). There are local clusters sensitive to each digit category. of the SOMperceptron system, and the 72.3 achieved by the perceptron layer alone (Table 1). These results suggest that the internal representations generated by the LISSOM map are more distinct and easier to recognize than the raw input patterns and the representations generated by the SOM map . 4 Discussion The architecture was motivated by the hypothesis that the lateral inhibitory con nections of the LISSOM map would decorrelate and force the map activity patterns to become more distinct. The recognition could then be performed by even the simplest classification architectures, such as the perceptron. Indeed, the LISSOM representations were easier to recognize than the SOM patterns, which lends evi dential support to the hypothesis. In additional experiments, the percept ron output layer was replaced by a two-weight-Iayer backpropagation network and a Hebbian associator net, and trained with the same patterns as the perceptrons. The recog nition results were practically the same for the perceptron, backpropagation, and Hebbian output networks, indicating that the internal representations formed by the LISSOM map are the crucially important part of the recognition system. A comparison of the learning curves reveals two interesting effects (figure 4). First, even though the perceptron net trained with the raw input patterns initially per forms well on the test set, its generalization decreases dramatically during training. This is because the net only learns to memorize the training examples, which does not help much with new noisy patterns. Good internal representations are there fore crucial for generalization. Second , even though initially the settling process of the LISSOM map forms patterns that are significantly easier to recognize than Laterally Interconnected Self-organizing Maps in Handwritten Digit Recognition 741 Figure 3: Final Afferent Weights of the LISSOM map. The squares identify the above-average inhibitory lateral connections to unit (10,4) (indicated by the thick square). Note that inhibition comes mostly from areas of similar functionality (i.e. areas sensitive to similar input), thereby decorrelating the map activity and forming a sparser representation of the input. the initial, unsettled patterns (formed through the afferent connections only), this difference becomes insignificant later during training. The afferent connections are modified according to the final, settled patterns, and gradually learn to anticipate the decorrelated internal representations that the lateral connections form. 5 Conclusion The experiments reported in this paper show that LISSOM forms internal represen tations of the input patterns that are easier to categorize than the raw inputs and the patterns on the SOM map, and suggest that LISSOM can form a useful front end for character recognition systems, and perhaps for other pattern recognition systems as well (such as speech). The main direction of future work is to apply the approach to larger data sets, including the full NIST 3 database, to use a more powerful recognition network instead of the perceptron, and to increase the map size to obtain a richer representation of the input space. Acknowledgements This research was supported in part by National Science Foundation under grant IRI-9309273. Computer time for the simulations was provided by the Pittsburgh Supercomputing Center under grants IRI930005P and IRI940004P, and by a High Performance Computer Time Grant from the University of Texas at Austin. References Allinson, N. M., Johnson , M. J., and Moon, K. J. (1994). Digital realisation of self organising maps. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 6. San Mateo, CA: Morgan Kaufmann. 742 Y. CHOE. J. SIROSH. R. MIIKKULAINEN Comparison:Test 'SettIEiCLlSSOU' - Epochs Figure 4: Comparison of the learning curves, A perceptron network was trained to recognize four different kinds of internal representations: the settled LISSOM patterns, the LISSOM patterns before settling, the patterns on the final SOM network, and raw input bitmaps. The recognition accuracy on the test set was then measured and averaged over 10 simulations. The generalization of the raw input  perceptron system decreases rapidly as the net learns to memorize the training patterns. The difference of using settled and unsettled LISSOM patterns diminishes as the afferent weights of LISSOM learn to take into account the decorrelation performed by the lateral weights. Denker, J. S., Gardner, W. R., Graf, H. P., Henderson, D., Howard, R. E., Hubbard, W., Jackel, L. D., Baird, H. S., and Guyon, I. (1989). Neural network recognizer for hand-written zip code digits. In Touretzky, D . S., editor, Advances in Neural Information Processing Systems 1. San Mateo, CA: Morgan Kaufmann . Fukushima, K., and Wake, N. (1990). Alphanumeric character recognition by neocognitron. In Advanced Neural Computers, 263-270. Elsevier Science Pub lishers B.V . (North-Holland). Ie Cun, Y., Boser, B ., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, 1. D. (1990). Handwritten digit recognition with a back propagation network. In Touretzky, D. S., editor, Advances in Neural Infor mation Processing Systems 2. San Mateo, CA: Morgan Kaufmann . Martin, G. L ., and Pittman, J. A. (1990). Recognizing hand-printed letters and digits. In Touretzky, D. S., editor, Advances in Neural Information Processing Systems 2. San Mateo, CA: Morgan Kaufmann. Sirosh, J., and Miikkulainen, R. (1994). Cooperative self-organization of afferent and lateral connections in cortical maps . Biological Cybernetics, 71:66-78. Sirosh, J., and Miikkulainen, R. (1995). Ocular dominance and patterned lateral connections in a self-organizing model of the primary visual cortex. In Tesauro, G ., Touretzky, D. S., and Leen, T . K., editors, Advances in Neural Information Processing Systems 7. Cambridge, MA: MIT Press. Sirosh, J., and Miikkulainen, R. (1996). Topographic receptive fields and patterned lateral interaction in a self-organizing model of the primary visual cortex. Neu ral Computation (in press).",
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1024]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Evaluation

Metrics

Information Retrieval

Metric Value
cosine_accuracy@10 0.9466
cosine_precision@10 0.0947
cosine_recall@10 0.9466
cosine_ndcg@5 0.8507
cosine_ndcg@10 0.8603
cosine_mrr@10 0.8323
cosine_map@10 0.8323

Training Details

Training Dataset

Unnamed Dataset

  • Size: 14,255 training samples
  • Columns: anchor and positive
  • Approximate statistics based on the first 1000 samples:
    anchor positive
    type string string
    details
    • min: 7 tokens
    • mean: 13.4 tokens
    • max: 24 tokens
    • min: 13 tokens
    • mean: 508.46 tokens
    • max: 512 tokens
  • Samples:
    anchor positive
    Proposed architecture for time-based pattern recognition in speech, motion, and signatures INTRODUCTION Recent interest in connectionist, or "neural" networks has emphasized their ability to store, retrieve and process patterns1,2. For most applications, the patterns to be processed are static in the sense that they lack temporal context. Another important class consists of those problems that require the processing of temporal patterns. In these the information to be learned or processed is not a particular pattern but a sequence of patterns. Such problems include speech processing, signature verification, motion detection, and predictive signal processin,r-8. More precisely, temporal pattern processing means that the desired output depends not only on the current input but also on those preceding or following it as well. This implies that two identical inputs at different time steps might yield different desired outputs depending on what patterns precede or follow them . There is another feature characteristic of much temporal pattern processing. Here an entire sequence of...
    Design approach for stabilizing analog VLSI neural systems INTRODUCTION The term "lateral inhibition" first arose in neurophysiology to describe a common form of neural circuitry in which the output of each neuron in some population is used to inhibit the response of each of its neighbors. Perhaps the best understood example is the horizontal cell layer in the vertebrate retina, in which lateral inhibition simultaneously enhances intensity edges and acts as an automatic lain control to extend the dynamic range of the retina as a whole. The principle has been used in the design of artificial neural system algorithms by Kohonen 2 and others and in the electronic design of neural chips by Carver Mead et. al.3 ,4. In the VLSI implementation of neural systems, it is convenient to build lateral inhibition networks by using a locally connected on-chip resistive grid. Linear resistors fabricated in, e.g., polysilicon, yield a very compact realization, and nonlinear resistive grids, made from MOS transistors, have been found useful for image segmentati...
    Neural network classifier using coding theory for improved classification capacity INTRODUCTION Associative recall using neural networks has recently received a great deal of attention. Hopfield in his papers [1,2) deSCribes a mechanism which iterates through a feedback loop and stabilizes at the memory element that is nearest the input, provided that not many memory vectors are stored in the machine. He has also shown that the number of memories that can be stored in an N-neuron system is about O.15N for N between 30 and 100. McEliece et al. in their work (3) showed that for synchronous operation of the Hopfield memory about N (2IogN) data vectors can be stored reliably when N is large. Abu-Mostafa (4) has predicted that the upper bound for the number of data vectors in an N-neuron Hopfield machine is N. We believe that one should be able to devise a machine with M, the number of data vectors, linear in N and larger than the O.15N achieved by the Hopfield method. Figure 1 (a) Classification problems versus (b) Error control decoding problems In this paper we are spe...
  • Loss: CachedMultipleNegativesRankingLoss with these parameters:
    {
        "scale": 20.0,
        "similarity_fct": "cos_sim"
    }
    

Training Hyperparameters

Non-Default Hyperparameters

  • eval_strategy: steps
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 500
  • learning_rate: 2e-05
  • num_train_epochs: 1
  • warmup_ratio: 0.01
  • bf16: True

All Hyperparameters

Click to expand
  • overwrite_output_dir: False
  • do_predict: False
  • eval_strategy: steps
  • prediction_loss_only: True
  • per_device_train_batch_size: 128
  • per_device_eval_batch_size: 500
  • per_gpu_train_batch_size: None
  • per_gpu_eval_batch_size: None
  • gradient_accumulation_steps: 1
  • eval_accumulation_steps: None
  • torch_empty_cache_steps: None
  • learning_rate: 2e-05
  • weight_decay: 0.0
  • adam_beta1: 0.9
  • adam_beta2: 0.999
  • adam_epsilon: 1e-08
  • max_grad_norm: 1.0
  • num_train_epochs: 1
  • max_steps: -1
  • lr_scheduler_type: linear
  • lr_scheduler_kwargs: {}
  • warmup_ratio: 0.01
  • warmup_steps: 0
  • log_level: passive
  • log_level_replica: warning
  • log_on_each_node: True
  • logging_nan_inf_filter: True
  • save_safetensors: True
  • save_on_each_node: False
  • save_only_model: False
  • restore_callback_states_from_checkpoint: False
  • no_cuda: False
  • use_cpu: False
  • use_mps_device: False
  • seed: 42
  • data_seed: None
  • jit_mode_eval: False
  • use_ipex: False
  • bf16: True
  • fp16: False
  • fp16_opt_level: O1
  • half_precision_backend: auto
  • bf16_full_eval: False
  • fp16_full_eval: False
  • tf32: None
  • local_rank: 0
  • ddp_backend: None
  • tpu_num_cores: None
  • tpu_metrics_debug: False
  • debug: []
  • dataloader_drop_last: False
  • dataloader_num_workers: 0
  • dataloader_prefetch_factor: None
  • past_index: -1
  • disable_tqdm: False
  • remove_unused_columns: True
  • label_names: None
  • load_best_model_at_end: False
  • ignore_data_skip: False
  • fsdp: []
  • fsdp_min_num_params: 0
  • fsdp_config: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
  • tp_size: 0
  • fsdp_transformer_layer_cls_to_wrap: None
  • accelerator_config: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
  • deepspeed: None
  • label_smoothing_factor: 0.0
  • optim: adamw_torch
  • optim_args: None
  • adafactor: False
  • group_by_length: False
  • length_column_name: length
  • ddp_find_unused_parameters: None
  • ddp_bucket_cap_mb: None
  • ddp_broadcast_buffers: False
  • dataloader_pin_memory: True
  • dataloader_persistent_workers: False
  • skip_memory_metrics: True
  • use_legacy_prediction_loop: False
  • push_to_hub: False
  • resume_from_checkpoint: None
  • hub_model_id: None
  • hub_strategy: every_save
  • hub_private_repo: None
  • hub_always_push: False
  • gradient_checkpointing: False
  • gradient_checkpointing_kwargs: None
  • include_inputs_for_metrics: False
  • include_for_metrics: []
  • eval_do_concat_batches: True
  • fp16_backend: auto
  • push_to_hub_model_id: None
  • push_to_hub_organization: None
  • mp_parameters:
  • auto_find_batch_size: False
  • full_determinism: False
  • torchdynamo: None
  • ray_scope: last
  • ddp_timeout: 1800
  • torch_compile: False
  • torch_compile_backend: None
  • torch_compile_mode: None
  • dispatch_batches: None
  • split_batches: None
  • include_tokens_per_second: False
  • include_num_input_tokens_seen: False
  • neftune_noise_alpha: None
  • optim_target_modules: None
  • batch_eval_metrics: False
  • eval_on_start: False
  • use_liger_kernel: False
  • eval_use_gather_object: False
  • average_tokens_across_devices: False
  • prompts: None
  • batch_sampler: batch_sampler
  • multi_dataset_batch_sampler: proportional

Training Logs

Epoch Step Training Loss cosine_ndcg@10
0.0893 10 0.5247 0.8247
0.1786 20 0.2625 0.8446
0.2679 30 0.2159 0.8485
0.3571 40 0.1849 0.8487
0.4464 50 0.2149 0.8506
0.5357 60 0.1538 0.8534
0.625 70 0.1617 0.8547
0.7143 80 0.1463 0.8575
0.8036 90 0.1626 0.8592
0.8929 100 0.1334 0.8598
0.9821 110 0.168 0.8603

Framework Versions

  • Python: 3.12.9
  • Sentence Transformers: 3.4.1
  • Transformers: 4.50.0
  • PyTorch: 2.5.1
  • Accelerate: 1.5.2
  • Datasets: 3.4.1
  • Tokenizers: 0.21.1

Citation

BibTeX

Sentence Transformers

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

CachedMultipleNegativesRankingLoss

@misc{gao2021scaling,
    title={Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup},
    author={Luyu Gao and Yunyi Zhang and Jiawei Han and Jamie Callan},
    year={2021},
    eprint={2101.06983},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}
Downloads last month
-
Safetensors
Model size
434M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Daria-best/stella_en_400M_v5_neurips_papers_fine-tuned

Finetuned
(17)
this model

Evaluation results