In data science projects, we use and analyze different types of data. In this article we shall cover the most common data types and their characteristics. This applies to data analytics, econometrics, statistical modeling and machine learning models.
Time Series Data: A time series is a series of data points indexed in time order. Most commonly, a time series is a sequence taken at successively equally spaced points in time. Thus, it is a sequence of discrete time data. This is also called as temporal data.
Time series data can be collected and observed at many different frequencies (hourly, daily, weekly, monthly, quarterly, anually, etc.).
Some examples of time series data:
• Daily stock prices
• Monthly rainfall
• Quarterly sales
• Annual company profits
Cross-sectional Data: Cross-sectional data consists of several variables recorded at the same time point.
Examples of cross-sectional data:
• The gross annual income for each of 1000 randomly chosen households in Kolkata city, India for the year 2021.
• A list of grades scored by a class of students on a particular exam.
• Data collected on sales revenue, sales volume and number of customers for the last month at a particular coffee shop.
In time-series data, the ordering of the data is important while in cross-sectional data, ordering does not apply.
Panel data: Panel data combines both cross-sectional and time series data. Here the same individuals (family, companies, persons etc.) are observed at several points in time (days, years, before and after treatment etc.). This is also known as longitudinal data.
Example of panel data:
Revenue, Cost, No. of employees and Profit for a company is reported every quarter for the last 12 quarters.
In Data science and machine learning, it is important to know the data type of the project as it defines the chart used for visualization as well as the method used for analysis and modeling. For time-series data, line chart is used to show trends and patterns while for cross-sectional data scatter plot or histogram may be used to show relationship or distribution of the variable respectively.
For modeling and forecasting based on time-series data, methods such as exponential smoothing, ARIMA, ARIMAX are used while for other data-types the methods are different. Hence it is important to know the dimensions of the data as well as data types before embarking on any analytics, data science and AI project.