By Alok Mehta and Gopi Katari
In today’s world, all organizations have put a significant emphasis on drawing insights from data collected from varied systems (both internal and external). These insights have given an edge to companies in making better decisions and conducting business. An important aspect of collecting the data and curating it falls into the hands of the Enterprise Content Management team (ECM), which has an immense responsibility in collecting both structured and unstructured content from a wide range of sources and storing high-volume content along with its metadata. However, this does not stop here. The content, which has been collected and stored, has a huge wealth of information that can be used for analytics including Artificial Intelligence (AI) and machine learning purposes. This article will touch upon the best practices curating content and how to have it ready for consumption for reporting and analytics.
Before we do a deep dive, let’s define different types of data that ECM system manages:
- Structured data: refers to data that is organized in a pre-defined format. For example, data stored in a relational database.
- Unstructured: data that does not have a pre-defined format. Examples of unstructured data include text documents, audio and video files, and social media posts.
- Hybrid (Semi-Structured) data: a component of both structured and unstructured data. For example, a questionnaire with both multiple-choice and multi-text answers.
Structured data compared to semi-structured and unstructured data poses fewer challenges when dealing with storage and generating analytical reports. When dealing with unstructured data there are many considerations and challenges that need to be managed:
- Format: Because unstructured data does not have a pre-defined format, it can be difficult to organize and search. This can make it challenging to extract and analyze the data.
- Volume: Unstructured data can often be very large in volume, which can make it difficult to store and process. This can require the use of specialized tools and infrastructure to handle the data.
- Heterogeneity: Unstructured data can come in many different forms and formats, which can make it difficult to analyze and integrate with other data sources. This can require the use of advanced data processing techniques to extract and normalize the data.
- Quality: Unstructured data can often be noisy or incomplete, which can make it difficult to extract accurate and reliable insights. This can require the use of advanced data cleaning and preprocessing techniques to improve the quality of the data.
While format, volume, form, and quality play a role, the best practices remain consistent across all three types of data, namely structured, unstructured, and semi-structured. The following techniques will help in enhancing the content data and can be very helpful for the teams consuming the data from the ECM:
Centralization: The standard norm across IT has been to centralize the content management repository. But many large, diversified companies are still hybrids, with some of the contents being stored in various applications. Apart from the cost benefit of having a centralized content repository, there are lots of other benefits too. It brings consistency to the content, the tools being used, and formats across all departments. Maintenance is simplified, helping all departments to reuse the content multiple times. This will avoid duplication of content across many applications. A centralized content repository will make life easier for data engineers. Instead of extracting data from multiple source applications, they will be able to focus on one system. If data engineers have to extract data from multiple sources, they have the extra task of mapping the data across all systems. This opens to the possibility of more errors and more prone to generating incorrect reports.
Data Quality: Take steps to ensure a high level of quality in the data and the content. There should be good standards for data definition and data quality rules established in the organization. Further, a continuous improvement process should be established. Regular monitoring, quality checks, and addressing quality issues as soon as possible should be part of the organization’s culture. The users who index the data must be trained to provide high-quality data before storing it in any system. This could be a long process, but it will pay high dividends in the long run with high-quality data and will be useful for other downstream systems.
Data Enhancement: The structured and unstructured data stored in the ECM system need to be cleansed, enriched, and verified before handing over to the analytics and AI team. This will improve the accuracy and reliability of the data, making it useful for all applications using it. The information stored in unstructured data can be extracted using different tools available in the market like Content Search Services, Optical Character Recognition (OCR), Natural Language Processing (NLP), text mining, labeling, and multimedia analysis tools. These tools help in extracting valuable information stored in unstructured data and make it available for further analysis.
Data Dictionary: A data dictionary is an important tool in enterprise content management (ECM) because it provides a central source of information about the data that is being stored and managed within an organization. It provides a detailed view of every attribute like name, data type and description for each data element. This information helps users understand the meaning and intended use of the data, and it can also help identify any potential errors or inconsistencies in the data. A data dictionary can also help improve the efficiency and effectiveness of ECM systems helping to ensure that the data is being used in the most appropriate and effective way and improve the overall quality of the information that is being managed by the ECM system.
Security: Data and contents are valuable assets for any company. ECM should take utmost care in protecting and securing the data and the contents. Here are some steps that can be taken by the ECM team:
- Make sure the data is encrypted at rest and in transit.
- Implement strong access controls rules. Access should be given only to authorized users.
- Take regular full and incremental backups.
- ECM systems should be well protected with strict firewall rules.
- Regular monitoring and auditing processes should be implemented.
Application Programming Interface (API): REST APIs are very useful when transferring data from one application to another. ECM has the option of using the native APIs provided by the platform or build custom APIs as per the requirement of the client applications. In general, APIs related to ECM provide the functionality of load, search, and retrieval of the documents. But it can be used further for purposes like event notification to other systems when a change happens in an ECM system, doing bulk upload of documents versus real-time upload of documents, etc.
Data Mapping: Data maps help the understanding of the flow of data across the organization. There are few steps that need to be followed at the source system:
- Step 1: To have good data mapping between applications and systems, there must be company-level standards for each data element and good classification of data, called taxonomy. In the insurance industry, the policy number should have the same unique representation, for example POL_NUM, across all the systems company-wide. Each system should not have different representations for policy numbers.
- Step 2: Metadata captured for all contents are stored in the repository. Good metadata management procedures and policies will ensure good data quality. Metadata captured should ensure accuracy, completeness, interoperability, and consistency. A regular audit of the metadata is preferred. All mandatory fields need to be identified and an index created for regularly used search fields.
Data mapping is a cross-functional effort and needs accountability. All stakeholders need to collaborate while preparing a comprehensive data mapping that caters to legal, IT, security, and records management principles. Both data going out of the organization and data coming in from vendors need to be tracked for controlling the risks of any violations. A good data map should be able to capture these risks. Data mapping takes a generous amount of time and effort. There are multiple tools in the market that can help in the automation of mapping of data between the source and target system. The right tool will depend on each organization’s goals and budget.
Compliance: All organizations must comply with relevant regulations and rules based on each industry. Some contents need to be retained due to legal holds, and some contents can be purged if met by the retention criteria. Data also needs to be anonymized and secured based on where it is stored and the sensitivity of the data. Once all the rules and regulations are understood, develop a plan to ensure all rules are being followed to avoid any breach. The data analytics and AI team need to be provided with the most current information. There are numerous tools that can do data classification, data encryption, auditing/monitoring, and compliance management that will help keep the data prepared.
Retention: Retention is an important concept in the field of ECM, as it refers to the length of time that data and other content is kept within the ECM system. The retention period is typically determined by a set of rules or policies that are established by the organization and it can vary depending on the type of content, its business value, and other factors. Retention should be applied to comply with legal and regulatory requirements. It is also needed for protecting against data loss or damage and preserving the value and integrity of the data.
Performance: ECM should provide fast and efficient access to data and the content. Service level agreements should be established for all operations and for each business unit. Any changes to the ECM system should trigger changes to the ECM performance scripts and should be verified by running the scripts. Conducting regular performance testing will improve the user experience, enhance the system capabilities and reduce the cost to the company by maintaining a stable system.
Now that we have the data prepared and enhanced, let us delve into the analytics that can be performed on structured, unstructured and hybrid data. There are different types of reports that can be generated out of structured data like a summary report, trend analysis, predictive reports, and productivity reports along with using visualization. Here are examples of those kinds of reports:
- AI and machine language can show insights into the system for trends, predictive analysis, and continuous self-learning.
- Sales reports, which show how much a company is selling, the average price of its products, and the most popular products or services.
- Marketing reports can show how effective a company’s marketing campaigns are at driving traffic and engagement.
- Customer service reports, which show how well a company is responding to customer inquiries and issues,
- Financial reports show a company’s financial performance, including its revenue, expenses, and profit or loss.
- Inventory reports, which show how much stock a company has on hand and how quickly it is being sold.
- Supply chain reports show the flow of goods and materials through a company’s supply chain.
Generating reports from unstructured data could be much more challenging. But those challenges can be reduced by following the process listed above. Here is the list of analytics that can be done on unstructured data:
- Text analytics, which involves extracting and analyzing the content of text data, such as customer reviews or social media posts.
- Network analysis involving analyzing the connections and relationships between different entities in a dataset, such as the connections between people in a social network.
- Topic modeling, such as identifying the main topics or themes in a collection of text documents.
- Sentiment analysis, which is extracting and analyzing the sentiment or emotion expressed in text data.
- Image and video analysis, or analyzing the content of images and videos, such as identifying objects or people in a video.
The best practices listed above are some of the major points to consider, but there will be many more aspects that have not been covered here and are pertinent and vital to your organization. The goal here is to make you think and utilize the information stored in content repository. In this competitive world, knowledge is the key. We believe that these best practices will allow you to position your ECM for better reporting and analytics used by your organization.
Alok Mehta is CIO Business Systems at Kemper: https://www.linkedin.com/in/dralokmehta/
Gopi Katari is Sr. Manager at Kemper: https://www.linkedin.com/in/gopalakrishnan-katari-aa338714/