如何使用环境监视查看原始日志 / How to use Environment Monitoring View Raw Logs-ww12345678 的部落格

This document explains how to use the "view raw logs" feature in LCS environment monitoring for your Cloud Dynamics 365 for Finance and Operations environments, this is the ability for you to look at some of the various telemetry data we record from your environments (for example slow queries) to give you insight into issues you might have, or crucially to react proactively before anyone notices there's an issue.

So what is this view raw logs?

Physically "view raw logs" is a button in LCS which shows you various telemetry data taken from your environment, things like long running queries. In the background this is surfacing telemetry data gathered from the environment - for all Microsoft-hosted cloud environments we're gathering telemetry data constantly. This is via instrumentation in our application, we are gathering a huge number of rows per hour from a busy environment. We store this in a Big Data solution in the cloud, this is more than just a SQL Database somewhere, as we are quite literally gathering billions and billions of rows per day, it's a pretty special system.

Timings - how quickly does it show in LCS and long is it kept for?

There is approximately a 10 minute delay between capturing this data from an environment and being able to view it in LCS. Data is available in LCS for 30 days - so you always have a rolling last 30 days.

A few limitations/frequently asked questions

- Is it available for on-premises? Not available for on-premises and not on the roadmap yet. This feature relies on uploading telemetry data to Microsoft cloud, so it doesn't feel right for on-premises. - Is it available for ALL other environments? It's available for environments within your Implementation project - so Tier1-5 environments and Production. It's not available for environments you download or host on your own Azure subscription. - Doesn't Microsoft monitor and fix everything for me so I don't need to look at anything? This can be a sensitive subject; Microsoft are monitoring production, and will contact you if they notice an issue which you need to resolve (that's quite new). Customer/Partner still own their code, and Microsoft won't change your code for you. During implementation and testing, you're trying to make sure all is good before you go-live, this is useful during that period too. So the reality is it's a little bit on all parties. - Is there business data shown/stored in telemetry? No. From a technical perspective this does mean things like user name and infolog messages are not shown, which as a Developer is annoying, but understandable.

Where to find view raw logs?

From your LCS project, click on "Full details" next to the environment you want to view, a new page opens, then scroll to the bottom of the page and click "Environment monitoring" link, a new page opens, click "view raw logs" button (towards the right hand side), now you're on the View raw logs page! Here's a walkthrough:

Explanation of the search fields

See below:

How to use "search terms" for a query?

This field allows you to search for any value in any column in the report. A common example would be looking for an Activity ID from an error message you get, for example:

An activity ID can be thought of as the ID which links together the log entries for a particular set of actions a user was taking – like confirming a PO. If you add a filter on this in the “All logs” query, as below, then you’ll see all logs for the current activity the user was performing – this is showing all events tagged with that activityId.

Tell me what each query does!

All logs

This query can be used to view all events for a giving user’s activity ID. If a user had a problem and saved the activity ID for you, then you can add it in the “search terms” box in this query and see all events for the process they were performing at the time. The exceptionMessage and exceptionStacktrace are useful for a Developer to understand what may have caused a user’s issue, these are populated when the TaskName= AosXppUnhandledException

All error events

This is a version of “All logs” which is filtered to only show TaskName=CliError, which means Infolog errors. There is only one column available on this report which isn’t already available in “All logs” which is eventGroupId, which serves no practical purpose. It is not possible to identify which users had the errors (user isn't captured directly on this TaskName). It is not possible to see the Infolog message shown to the user (because it could have contained business data so can't be captured). The callstack column shows the code call stack leading to the error.

User login events

This shows when user sessions logged on and off. The user IDs have been anonymized as GUIDs so to track them back to actual users, in the "Users" form inside Dynamics look at the "telemetry ID" field, on environments where you have database access you can look in the USERINFO table at OBJECTID. The report is pre-filtered to show 7 days activity from the end date of your choosing. There is a maximum limit of 10000 rows, the report isn’t usable if you have over 10k in 10 days. This report could be useful to make statistics about the number of unique users using the system per day/week/month, by dumping the results to excel and aggregating it.

Error events for a specific form

This shows all TaskName=CliError (Infolog errors) for a specific form name you search for. The form name is the AOT name of the form, e.g. TrvExpenses, not the name you see on the menu, e.g. Expenses. The call stack is visible for any errors found. This can be useful when users are reporting problems with a particular form, but they haven't given you an ActivtyId from an error message - using this query you can still find errors/call stacks related to the form.

Slow queries

This shows all slow queries for a time period. SQL query and call stack is shown. The Utilization column is the total estimated time (AvgExecutionTimeinSeconds * ExecutionCount) – we're calling it "estimated" because it’s using the average execution time and not the actual time. Queries over 100ms are shown. This one is one of my favourites, it's very useful to run after a set of testing has completed to see how query performance was. A Developer can easily see where long queries might be coming from (because SQL and call stack are given) and take action.

SQL Azure connection outages

Shows when SQL Azure was unavailable. This is very rare though, I've never seen it show any data.

Slow interactions

Ironically the "slow interactions" query takes a long time to run! The record limit isn’t respected on this query – it shows all records regardless. This means if you try to run it for longer periods it’ll fail with “query failed to execute” error message as the result set is too large, run it in small date ranges to prevent this. This one includes the slow query data I mentioned earlier, and also more form related information, so what this one can give you is a rough idea of the buttons a user pressed (or I should say form interactions to be more technically correct) leading up to the slow query. If you're investigating a slow query, looking for it here will give you a bit more context about the form interactions.

Is batch throttled

This shows whether batches were throttled. Batch throttling feature will prevent new batches from running temporarily if the resource limits set within the batch throttling feature are exceeded - this is to try and limit the resources that a batch process can use, to ensure that sufficient resources are available for user sessions. The infoMessage column in this report shows which resource was exceeded. Generally speaking you shouldn't hit the throttling limits - if you see data in here, it's likely you have a runaway batch job on your hands - find out which one and look at why it's going crazy.

Financial reporting daily error summary

Shows an aggregated summary of errors from the Financial Reporting processing service (used to be called Management Reporter). This gives you a fast view if anything is wrong with Financial Reporting, this is hard-coded to filter for today, but as processing runs every 5 minutes in the background that is ok. Typically use this if a user reports something is wrong/missing in Financial Reporting, to get a quick look at if any errors are being reported there.

Financial reporting long running queries

This shouldn't return any data normally - it might do if a reset has been performed on Financial reporting and it's rebuilding all of it's data. Generally for customers and partners I would recommend not to worry about this one, it's more for Microsoft's benefit.

Financial reporting SQL query failures

Again this one shouldn't return data normally. This helps to catch issues such as, when copying databases between environments if change tracking has been re-enabled, then Financial reporting will be throwing errors when it tries to make queries against change tracking.

Financial reporting maintenance task heartbeat

The Financial reporting service reports a heartbeat once a minute to telemetry to prove it's running ok. This report shows that data summarized - so it has 1 row per hour, and should show 60 count for each row (i.e. one per minute). This allows you to see if the service is running and available. Note that the report doesn't respect the row limit, but as it's aggregated it doesn't cause a problem.

Financial reporting data flow

For those of you familiar with the old versions of Financial Reporting (or management reporter), this is similar to the output you used to get in the UI of the server app, where you can see the various integration tasks and whether they ran ok, and how many records they processed. This is useful for checking if the integration is running correctly or if one of the jobs is failing. Note that this query also ignores the row limit, so run it for a shorter time period or it'll run for a long time.

Financial reporting failed integration records

I'd skip this one, it's showing just the timestamp and name for each integration task (similar to the "Financial reporting data flow" query above, but with less information), the name suggests it shows only failures, but actually it shows all rows regardless. Use the "Financial reporting data flow" query instead.

All events for activity

You can skip over this one - it's very similar to the "All logs" query – but it also has SQL server name and SQL database name, which are irrelevant as you’ve already chosen an environment to view it so you know which server and database it is.

All crashes

This shows AOS crashes, it tells you is how many crashes there were but it’s not directly actionable from here. If you have data here, log a support ticket with Microsoft - on the Microsoft side we have more data available about the crash which means it is actionable. Microsoft are proactively investigating crash issues we see through telemetry. Keeping up to date on platform updates helps prevent seeing crashes.

All deadlocks in the system

The title of this query is odd "in the system", ahh thanks for clarifying I thought it was all deadlocks in the world. This shows SQL deadlocks, and gives the SQL statement and call stack. You can use this similarly to the "Slow queries" query, for example after a round of testing has completed, review this log to check whether the tested code was generating deadlocks anywhere - and if it is then investigate the related X++ code.

Error events for activity

This is a filtered versions of the query “All events for activity” showing only errors, which itself is a version of "All logs" - it means if you've been given an ActivityId you could use this one to jump straight to only the error events relating to that activity - whereas "All logs" would show you errors + other data.

Distinct user sessions

This one shows, for each user, how many sessions they’ve had during a time period. You could use this to look at user adoption of the environment - the number of unique users per day/week/month - see if users are actually using it. It is similar to "User login events", just aggregated.

All events for user

This one is named in a confusing way – really it is showing user interaction events for a user – so it’ll show you everything a user pressed in forms in the time period. The tricky thing is that user IDs are obfuscated so you need to find the GUID for the user first – look it up in the "Users" form inside Dynamics. You might use this to see what a particular user was doing during a period, if you're trying to reproduce something they've reported and the user isn't being very forthcoming with information. The information shown here is a little difficult to interpret, it's very much Developer focused.

All events for browser session

This allows you to look up results using the session ID from an error message - remember right at the beginning of this article the screenshot about how to use the ActivityId from an error message - well also in that message was a "Session ID" this query let's you show logs for the session. Imagine an Activity ID is a set of related events within a session, and a Session ID is the overarching session containing everything while the user was logged in that time. Find the official page on Monitoring and Diagnostics here.

如何使用环境监视查看原始日志 / How to use Environment Monitoring View Raw Logs