Over the last years, human activity recognition has gained considerable success due to the importance to understand what a person is doing during normal daily activities. The correct understanding of human-being activities is playing a fundamental role also in the field of Human-robot interaction. Indeed, robots can collaborate with humans and help people to perform their actions. The recognition of activities can be achieved by exploiting different kinds of sensors, such as inertial or visual ones. However, no single modality sensor can cope with all the situations that occur in the real world. The VISTA dataset as a unique compromise for the recognition of activities of daily living between datasets based on visual sensors and the ones based on the inertial ones. Particularly, the unique values of the VISTA dataset are: Daily Actions and Activities. For VISTA actions selection we referred to the Cornell Activity Dataset (CAD-60) and the MSR Daily Activity 3D Dataset. Ten actions in common between the two datasets were selected. Then, these actions were combined in five activities (i.e. scene) of daily living. Multimodal data. VISTA dataset proposes the combination of RGBD cameras and inertial sensors positioned on the fingers and the wrist of the subjects. This allows creating a complete and heterogeneous dataset, in which the big movements of the body can be captured by the visual sensors and the fine ones by the inertial ones. Multiple views. One of the drawbacks of using a vision system is the camera occlusion that can occur if the subject is not positioned frontally to the camera. VISTA dataset provides the simultaneous video record from two different perspectives, frontal and lateral, with the purpose to investigate whether the use of a multimodal approach could improve the accuracy of the recognition when the subject is not in a frontal position, thus to mimic more realistic operative conditions. Human-Robot Interaction. VISTA dataset was acquired from the interaction with the Pepper robot. The participants interacted with the robot and performed what it was explaining and requiring. We envisage several applications in which our dataset can be used. For instance, action and scene recognition, analysis of transition between actions, development of customized machine learning algorithms, comparison of approaches using different sensor modalities, just to name a few. The implications of such research efforts are important in the field of activity recognition. but also in the field of Human-robot interaction, since the understanding of the actions and the context can improve the robot perception ability.